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Abstract 

This paper studies the problem of estimating the covariance of a collection of vectors using only 
extremely compressed measurements of each vector. An estimator based on back-projections of these 
compressive samples is proposed and analyzed. A distribution-free analysis shows that by observing just 
a single compressive measurement of each vector, one can consistently estimate the covariance matrix, in 
both infinity and spectral norm, and this same analysis leads to precise rates of convergence in both norms. 
Via information-theoretic techniques, lower bounds showing that this estimator is minimax-optimal for 
both infinity and spectral norm estimation problems are established. These results are also specialized to 
give matching upper and lower bounds for estimating the population covariance of a collection of Gaussian 
vectors, again in the compressive measurement model. The analysis conducted in this paper shows that the 
effective sample complexity for this problem is scaled by a factor of m 2 /d 2 where m is the compression 
dimension and d is the ambient dimension. Applications to subspace learning (Principal Components 
Analysis) and learning over distributed sensor networks are also discussed. 


1 Introduction 


Covariance matrices provide second-order information between a collection of random variables and play a 
fundamental role in statistics and signal processing. Concrete examples include dimensionality reduction, 
where covariance information is a sufficient statistic for the widely used Principal Components Analysis 
(PCA), and linear discriminant analysis, a popular classification method. An important statistical task is 
covariance estimation, where the goal is to recover the population covariance matrix of a distribution, given 
independent and identically distributed samples. 

In this paper, we study a variant of the covariance estimation problem, where the samples are ob¬ 
served only through low-dimensional random projections. This estimation problem has roots in compressed 
sensing, where random projections have been used to reduce measurement overhead associated with high¬ 
dimensional signals. It is also motivated by problems in learning over distributed sensor networks, where 
both power and communication constraints may limit the measurement capabilities of a single sensor. We 


describe this application in more detail in Section 4.3 
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In the first part of the paper, we propose and analyze a covariance estimator based on these low¬ 
dimensional, compressed, observations. We specifically consider a model where an independent random 
projection is used for each data vector. We show that even when each vector is observed only via projection 
onto a one-dimensional subspace (i.e., one linear measurement), one can consistently and accurately esti¬ 
mate the sample covariance matrix of the data vectors in both spectral and infinity norms. In our analysis, 
we make no distributional assumptions on the data vectors themselves and attempt to recover the sample 
covariance, since no population covariance exists. We present a specialization of this distribution-free anal¬ 
ysis to the case where the data vectors are drawn from a Gaussian distribution and the goal is to estimate 
the population covariance. As two motivating applications of our analysis, we give guarantees for subspace 
learning (or Principal Components Analysis) and for learning over distributed sensor networks. 

In the second part of the paper, we consider the fundamental limits of this estimation problem. Using 
information-theoretic tools, we derive lower bounds for a variety of settings, including the distributional and 
distribution-free settings under which we analyze our estimator. This analysis reveals that our covariance 
estimator nearly achieves the minimax-rate for this problem, meaning that modulo constants and logarithmic 
factors, our estimator is the best one can hope for. We also consider an alternative popular measurement 
paradigm, where a single low-dimensional random projection is used for all data vectors and show that this 
approach is inconsistent for the covariance estimation problem. 

Our work deviates from the majority of work on compressive estimation in that we do not make structural 
assumptions on the estimand, in this case the target covariance. A number of papers assume that the target 
covariance is low rank 00, sparse m, or that the inverse covariance is sparse Gum. The broad theme 
of this line of work is that when the target covariance has some low-dimensional structure, far fewer total 
measurements (via random projection) are necessary to achieve the same error as direct observation in the 
unstructured case. However when the target covariance does not have low-dimensional structure, these 
methods can fail dramatically, as we show with our lower bounds. 

In contrast, our work instead examines the statistical price one pays for compressing the data vectors 
when the covariance matrix does not exhibit any low dimensional structure. In the unstructured setting, 
compressing the data requires that one use significantly more measurements to achieve comparable level of 
accuracy to the fully observed setting. We precisely quantify this increase in measurement, showing that 
the effective sample size shifts from n to nm 2 /d 2 , where the projection dimension is m and the ambient 
dimension is d. Since we must have m < d, this means that one needs more samples to achieve a specified 
accuracy under our measurement model, in comparison with direct observation. This effective sample size 
is present in all of our upper and lower bounds, showing that indeed, there is a price to pay for compression 
without structural assumptions. Note that this quadratic growth in effective sample size also matches recent 
results on covariance estimation from missing data mm. 

While our focus is on the unstructured case, we do show that our estimator can adapt to structure present 
in the problem. Specifically, in the case where the data vectors lie on a fc-dimensional subspace, the error 
bounds for our estimator match those of other approaches that specialize to this low rank setting mm. Thus, 
the simple estimator we introduce here addresses both structured and unstructured covariance estimation 
tasks. 

Regarding proof techniques, our upper bounds are based on analysis of a carefully constructed unbiased 
estimator for the target covariance matrix. The natural estimator for this problem is asymptotically biased 
and hence inconsistent, and by exploiting properties of the Beta distributions that arises from random pro¬ 
jections, we analytically de-bias this natural estimator. To obtain error bounds for this estimator, we use 
concentration-of-measure arguments. The challenge in this part of the proof is that the relevant random vari¬ 
ables have very large range even though their tails decay quite favorably; consequently, a straightforward 
application of a Bernstein-type inequality is too pessimistic. In the case we avoid this issue with a con- 
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ditioning argument, first showing that the random variables have much smaller range with high probability 
and then applying a Bernstein-type inequality conditioned on this event. In the spectral norm case, we use a 
more powerful deviation bound (The Subexponential Matrix Bernstein inequality ll26l ) that exploits sharper 
decay on all moments of the relevant random variables. 

For the lower bounds, our main technical contribution is a strong data processing inequality EJ ED [22] 
which upper bounds the Kullback-Leibler divergence between two compressed Gaussian distributions by a 
small (less than one) multiple of the KL-divergence before compression. This contraction in KL-divergence, 
in concert with a standard approach for establishing minimax lower bounds known as Fano’s method, gives 
the lower bounds in this paper. 

The remainder of this paper is organized as follows: We conclude this section with a formal specification 
of the covariance estimation problem and the observation model. In Section [2] we mention related results 
on covariance estimation and matrix approximation. In Section[3] we develop our covariance estimator, pro¬ 
viding a theoretical analysis in Section[4] Section [4] also contains some simulations results and a discussion 
of applications to subspace learning and learning in distributed sensor networks. We present all of our lower 
bounds in Section[5] All proofs of our theorems are in Section[6]with a brief discussion in Section[7] Several 
technical lemmas are deferred to the appendices. 

1.1 Setup 

Let Xi, .. ., x n be a collection of vectors in and define the covariance E = - Yli= i x i x f ■ We make 
no distributional assumptions on the sequence {xi}" =1 and therefore aim to recover the sample covariance 
E, since there is no well-defined population version. In particular, the sequence could be adversarially 
generated. Let A' £ R dxn be a matrix whose /th column is the data vector x t . When we specialize to the 
distributional setting, we will assume that the sequence x\,...,x n ~ Af( 0, E), where E is the population 
covariance. Whether E refers to the sample covariance in the distribution-free setting or the population 
covariance in the distributional setting will be clear from context. 

Independently for all t, let A t £ R' ixm be an orthonormal basis for an m-dimensional subspace drawn 
uniformly at random. We are interested in estimating E from the observations {(A t , Afxt)}™—i, so that 
each vector is compressed from d dimensions down to m dimensions. Note that this measurement scheme is 
equivalent, in an information-theoretic sense, to drawing m-dimensional orthogonal projections <E> t £ W lxd 
uniformly at random, and independently for all t, and observing {('hi, dbxV)}" =1 • This equivalence can be 
easily seen by noting that the matrix A t Af is a uniform-at-random m-dimensional orthogonal projection, 
while a uniform-at-random orthonormal basis for the subspace encoded by <1>, has the same distribution as 
At. In both cases the vectors x t have been compressed down to m dimensions. 

As terminology, we use the phrases “data sequence” and “samples” to denote the vectors x \,... ,x n , 
which we emphasize are only observed via compression. We use “observations” for the equivalent repre¬ 
sentations (At, AjXt) and ('\> t . We reserve the word “measurements” for the linear operators A t or 

<f> t , which act on the data sequences to produce the observations. The term “sample complexity” refers to 
the number of samples n as a function of the parameters m and d required to achieve a desired error for a 
particular taslQ 

We use several standard matrix and vector norms throughout the paper. For a matrix M, let | M \ \ p = 
d enote the Frobenius norm and let H-Mjloo = max, j \M,j | denote the element-wise infinity 
norm. We also use ||M|| Pi9 to denote the l p , q mixed norm, which is l q norm of the £ p norms of the columns 

technically, sample complexity refers to the number of samples n as a function of m, d , e, and S that suffice to achieve error e with 
probability at least 1 — 5 for a particular estimation task, but we often use this phrase loosely and suppress dependence on e and S. 
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of the matrix. For example, ||M|| 2 ,oo = maxj \\rrij H 2 for a matrix with columns m 3 -, and this specific norm 
appears several times in our analysis. For a symmetric matrix M £ R. dxd let ||M ||2 = max x .|| x |i 2= i x T Mx 
denote the spectral or operator norm. For a vector v, 11 v 11 2 , ||v||oo are the Euclidean and inhnity norm, 
respectively. Lastly, we use the standard Big-O and Little-O notation for asymptotic characterizations, where 
O and Cl suppress dependence on logarithmic factors. 

2 Related Work 

Estimating a covariance matrix from samples is a classical problem in statistics with applications across 
the spectrum of scientific disciplines. Recent work has focused on the high-dimensional setting, where the 
dimensionality of the data points is large relative to the number of samples. In this setting, a number of 
structural assumptions that lead to tractable estimators have been proposed and studied. 

One approach to these estimation problems is through compressive sensing mom, where the data is 
observed through low-dimensional random projections from which the estimand can be algorithmically re¬ 
covered. The main insight behind compressed sensing is that one can exactly solve underdetermined linear 
systems, provided that the parameter vector exhibits some structure, which was classically sparsity. These 
ideas have been extended to the matrix setting, where a number of papers study matrix recovery and co- 
variance estimation from compressive measurements 000. These results all make strong structural 
assumptions about the matrix of interest, namely that it is low rank, approximately low rank, or sparse. Our 
work deviates from these results in that we make no structural assumptions about the underlying covariance 
matrix and still aim for consistent recovery. 

For example, Chen et al. a study the low-rank matrix recovery problem, where the matrix M is observed 
through noisy quadratic measurements of the form af Mai + £■ Cai and Zhang 0 obtain similar results but 
also consider the spiked covariance model in which the covariance matrix E = I + M and M is low rank. 
Specifically, they consider a setting where data vectors x, are drawn from a Gaussian distribution with spiked 
covariance but only ( a,i,a[xi ) is observed. Finally, Dasarathy et al. 0 consider compressive covariance 
estimation under a sparsity assumption. All of these works make strong structural assumptions on the matrix 
which enables both (a) lower sample complexity guarantees and (b) the use of a shared compression operator 
across columns. Our setting is completely unstructured, and thus the existing results do not apply. This 
unstructured problem is statistically harder, and our minimax lower bounds certify both that these low sample 
complexities are not achievable and that the shared compression approach is inconsistent. Nevertheless, we 
do note that in the presence of structural assumptions, our estimator does achieve statistical error that is 
comparable with these more specialized approaches (See Corollary |5J. 

To our knowledge, the only other results that do not require structural assumptions are those of Pourkamali- 
Anarki and Hughes EHl and our previous work fl6l . As in the present work, Pourkamali-Anarki and 
Hughes l2ll study the covariance estimation problem when the data is observed only through low-dimensional 
linear measurements. While their main interest is in using sparse measurement matrices to study a computational- 
statistical tradeoff for this problem, in the dense case most similar to ours, their estimator can be used for 
consistent estimation of the principal components, although it is biased for the target covariance and they do 
not characterize the rate of convergence. In contrast, our work studies an unbiased estimator for the target 
covariance, and our main interest is in precisely characterizing the estimation error for the covariance esti¬ 
mation task. Additionally, our work makes much weaker assumptions on the data (specifically they make 
distributional assumptions while we operate in an adversarial setting) while their work considers a broader 
class of linear measurements, although their class does not contain the random projections we study. 

Our previous work m focuses on the compressive subspace learning problem but does provide guar- 
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antees for covariance estimation. There are, however, two main shortcomings that we resolve in the present 
paper. First, the estimator there is based on a version of data splitting, and, consequently, it does not allow 
for the m = 1 case as we do here. Secondly, the rates of convergence are worse in our previous work, 
whereas we show that the rates derived in this paper are minimax optimal. 

In the theoretical computer science and numerical linear algebra literature, there are several works that 
use random projections for the purposes of fast approximation to the singular value decomposition of an 
unstructured matrix, cf. mmm In this matrix approximation problem, we fully observe a data matrix 
X £ R. dxn , and the goal is to quickly output a rank k matrix X with the guarantee that \\X — X < 
f (11 X — Xk 11) for some norm (usually spectral or Frobenius) and some function /, where A’/. is the best rank- 
k approximation to X. The main differences with our setting are (a) that the data matrix is fully observed, 
(b) the low-rank versus unstructured nature, and (c) that the performance measure is on the data matrix X 
rather than the covariance E. Consequently, state-of-the-art matrix approximation algorithms provably fail 
at covariance estimation (as we show), while the procedure we develop for covariance estimation does not 
achieve state-of-art matrix approximation performance. 

Despite these differences, it is worth briefly discussing algorithmic ideas in the matrix approximation 
literature. For concreteness, consider the algorithm of Halko, Martinsson, and Tropp fl4l . which is a rep¬ 
resentative example from this line of research. Their algorithm first right-multiplies X by a small random 
matrix R to obtain a matrix Y = XR that approximates the column space of A', and then it projects the 
columns of A' onto the subspace spanned by Y to obtain the estimator X = YY r X. This algorithm both 
pre- and post-multiplies the matrix A, which amounts to obtaining compressive measurements of both the 
rows and columns of the data. This is not possible in our setting, where the data sequence is observed only 
through compressive measurements, and therefore these algorithms do not address our problem. 

Another closely related line of work focuses on matrix recovery from missing data. While the majority 
of the results here focus on low rank matrices or other structured settings Buna, there have been recent 
results focusing explicitly on the covariance estimation problem in the unstructured setting. For example, 
Kolar and Xing im consider a setting where each coordinate of each data vector is missing with probability 
1 — a, independently from other coordinates and vectors. They show that one can estimate the covariance 
matrix in £^ norm with 0(na 2 ) samples and use this estimator to learn the structure of a Gaussian graphical 
model. Gonen et al. fl3l study the subspace learning problem in the missing data setting, while Loh and 
Wainwright lfl8l consider high-dimensional linear regression with missing data. Both works show that the 
sample complexity is reduced by a factor related to the squared fraction of entries observed per column. 
While the measurement model in our work is different, qualitatively our main result is similar to these; since 
rri/d = a, we show a similar reduction in effective sample complexity when the data is observed via random 
projection. 

While the statistical behavior is similar in these two settings, obtaining sharp error bounds for the com¬ 
pressed case is analytically much more challenging than the missing data one. In the missing data case, it is 
common (in fact, necessary) to assume that the data vectors are incoherent or have small norm relative 
to their £2 norm (The typical assumption is ||^o — f° r some constant fi Il24l ). One can obtain 

error bounds for uniform-at-random coordinate sampling under this assumption simply by application of 
a Bernstein-type inequality, as all of the relevant random variables have range and variance on the order 
of ^||x t ||| if m measurements per sample are obtained. On the other hand, the measurement process in 
the compressive model leads to random variables with range as large as ||xt|| 2 , so that a naive application 
of these concentration inequalities yields very weak error bounds. Our analysis therefore uses two more 
sophisticated approaches: for the ^-norm bound we use a two-stage conditioning argument, and for the 
spectral norm bound we use a more refined deviation bound that exploits the sharper tail decay of our random 
variables. 
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3 The Covariance Estimator 


In this section, we develop our covariance estimator from compressed observations. The estimator is based 
on an adjustment to the observed covariance, i.e. the covariance of the observations <f>t x t- This adjustment 
is motivated by a characterization of the bias of the observed covariance. 

Specifically, let Si = Y^t=i(^t x t)(&t x t) T be a rescaled version of the observed covariance. Ei 
is an intuitive estimator for S, but, as we will see, it is biased even as n —> oo, meaning that Si is not a 
consistent estimator for S. Instead, our estimate for the sample covariance S is, 

to ((d + 2)(d — l)Si — (d — m) tr(Si) Id) 

v _ V _ L m 

d(dm + d— 2 ) 


This estimator is a de-biased version of Si. 

The specification of E is motivated by the following proposition, which analytically characterizes the 
bias of Si based on properties of the Beta distribution (See Fact 


111 . 


Proposition 1 (De-biasing). Let S = T ^ t=i x t x T andt, ! = {®t x t)($t x t) T ■ Then, 


jgg, _ d{dm + d - 2)E + d(d - to) tr(E)/ d ) 
1 m(d + 2 )(d — 1) 


( 2 ) 


Proof. See Section [672] □ 

With this expansion of the bias, it is also easy to see that tr(EEi) = — tr(E). Substituting in for tr(E) 
and re-arranging, we see that. 


S = 


m d + 2 )(d — l)ESi — (d — to) tr(EE 1 )/ £ ;^ 
d(dm + d — 2) 


Since trace is a linear operator, we immediately see that our estimator is unbiased for E. 


(3) 


4 Upper Bounds and Consequences 

We now turn to our analysis of the estimator E. In this section, we upper bound the error in both spectral 
and foo norms for E in the distribution-free setting. We also specialize these results to the problem of 
estimating the population covariance of a collection of Gaussian vectors. Lastly, we present a brief numerical 
simulations and discuss applications to subspace learning and learning in distributed sensor networks. 

Our first two theorems give error bounds for our estimator, in entry-wise and spectral norm, respec¬ 
tively. 

Theorem 2 ( l ^ Upper Bound). Let d > 2 and 5 £ (0,1) such that S > 4 d 2 exp (—n/12). There exist 
universal constants fti, K 2 > 0 such that with probability at least 1 — <5, 


IS — Slloo < Kill XI 


Id 2 log 2 ('f) 


nm* 


+ K2P1&, 


d 2 log 2 (f) 


nm* 
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Theorem 3 (Spectral Upper Bound). Let d > 2 and define, 

1 


S, = 


J2\\ x t\\l x t^ 


t =i 


S 2 = -£ \\x t \\i 

n L ^ 


t= 1 


There exists universal constants k\, k 2 >0 such that for any 5 £ (0,1), with probability at least 1 — 8 , 

llog(d/ 8 ) d\\X\\l c 



«2- 


log (d/ 8 ). 


We defer the proofs of both theorems to Section[6] Note that both theorems are distribution-free; we make 
no assumptions on the data. In particular, the data sequence can be adversarially generated, but observe that 
some sequence-dependent quantities do appear in the deviation bounds. In both theorems, the condition on 
the dimension d leads to a much cleaner statement of the result but can be relaxed slightl\f] Similarly, in 
Theorem[2] the assumption on 8 are very mild as the lower bound is decaying exponentially. 

For Theorem[3] notice that in the worst case Si < ||X||| ^||HI|| 2 and S 2 < d||Xooll^lb- With both 
of these bounds and when n is sufficiently large, the error guaranteed by the theorem has a leading term 


of order O 


d 2 11X112. 


. We leave the dependence on S\ and S-> in the theorem statement because 


much sharper bounds on these two quantities are often possible. In particular, we will show that when the 
target covariance is low rank, one can obtain a much better spectral norm error bound. 

To compare with existing work, it is best to specialize to the case where the data vectors come from a 
Gaussian distribution. Using standard tail bounds on ||X|| 2 ,oo and ||W]|oo yields the following corollary: 

Corollary 4 (Gaussian Upper Bounds). Let xi,...,x n ~ 7V"(0, E) and construct E as in Equation^ Then 
for any 8 £ (0,1), there exist universal constants c, Ki, k 2 , K 3 > 0 such that, with probability at least 1 — cS, 


IE-eil, <«!||S| 


|E-E||2<« 3 ||E|| 


Id 2 \og 6 (nd/ 8 ) | /log (d/ 8 ) | 

n f 


+ k 2 ||E|| 


d 2 log 3 (nd/ 8 ) 


' d 3 log 2 (nd/5) ^ d 3 log 2 (nd/ 8 ) ^ /log(2 d/ 8 ) | 


nm 2 


nm 2 


r 


(4) 


(5) 


The first bound holds when d > 2 and 8 > d 2 exp(—n/12), while the second bound holds whenever d > 2. 
Here we make several remarks: 

1. When n is large relative to d 2 /m 2 and ignoring logarithmic factors, the leading terms in the error 
bounds are O 


d 2 

nm 2 


in norm and O 


d 3 

nm 2 


in spectral norm. In comparison, to estimate 


the population covariance of a Gaussian distribution in the fully observed setting, it is well known that 
the sample covariance achieves rates O an d O (^J^j infinity and spectral norm respec¬ 

tively li29l . Thus, the effective sample size shrinks from n to nm 2 /d 2 in the compressed setting. 


2 A more precise result can be obtained by examination of our proofs. 
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Figure 1: Rates of convergence for our compressed covariance estimator alongside rescaled rates for infinity 
and spectral norm and different settings of d , to. Rescaling factor is and \j for infinity norm 

(left two panels) and spectral norm (right two panels), respectively. 


2. Apart from our previous work d, this is the first estimator with such strong guarantees in the com¬ 
pressed setting. As we mentioned, most existing work focuses on recovery under strong structural 
assumptions of the population covariance, for example low rank 00 or spiked covariance 0. 

3. Nevertheless, we can show that our estimator adapts to low dimensional structures exhibited by the 
target covariance. In the Gaussian case, if the target covariance has rank at most fc, then an esti¬ 
mator E/, : formed by zero-ing out all but the largest k eigenvalues of E has a much more favorable 
convergence rate: 

Corollary 5. Consider the same setting as Corollary^with d > 2, but further assume that rankfS) < 
k. Then, for any S £ (0,1), with probability 1 — 5, 

,.a I dk dk 2 dk \ , 

— ^II2 < K S 2 I \ -1-2 ^-) ( nc V^), 

\ V nm nm z nm I 

for a universal constant k > 0. 


This is the low-rank version of the spectral norm bound in Corollary [4j the only additional assumption 
is that the target covariance has rank at most k. The bound shows that if n = 0(d), we may set 
to = 0 (k log 2 (d)/e 2 ) to achieve spectral norm error e, which agrees with the sample complexity 
bounds in recent results 0[3li24l. 


4. 


In comparison with our previous work OH, the results here are significantly more refined. First, our 
previous work used a data-splitting technique to avoid the bias demonstrated in Proposition [T] and 
consequently did not address the to = 1 case as we do here. Secondly, in terms of rates, the results 
here are actually sharper. In Krishnamurthy et al. Id, the rate of convergence in spectral norm for 


the full rank Gaussian case is O 



d , 3 

nm 2 


which is polynomially worse than our O 



bound here. As we will see in Theorem |8]below, this latter rate is minimax optimal. 


The proof of these results are based on showing that S-| concentrates sharply around its mean. For both 
results, a crude application of exponential deviation bounds does not suffice, as the relevant random variables 
have large range, although the tails decay quite quickly. We therefore use a more refined analysis to exploit 
this sharp tail decay. For the infinity norm bound, we use a conditioning argument where we first provide a 
probabilistic bound on the range of the random variables and then apply the Bernstein inequality conditioned 
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on this event. For the spectral norm bound of Theorem[3] we instead use properties of Beta random variables 
to upper bound all of the moments in terms of the quantities S\ and S 2 and then apply the Subexponential 
Matrix Bernstein inequality ll26l . The corollaries are based on using well-known Gaussian concentration 
inequalities to bound the sample-dependent quantities in the theorems. 

4.1 Synthetic Experiments 

In Figure [I] we empirically validate the error bounds in Corollary [4] by recording the infinity and spectral 
norm error of our estimator across several problem parameters. In the left two panels, the data is drawn 
from a multivariate Gaussian distribution whose covariance matrix has unit infinity norm, while in the right 
two panels, the data is normally distributed and the covariance matrix has unit spectral norm. We plot the 
infinity norm error (left most) as a function of the number of samples alongside the rescaled infinity norm 

error ||E — d (second from left). Similarly, in the right two panels, we plot the spectral norm 

error (second from right) alongside the rescaled error ||E — S ||2 \J^£~ (right most). 

The experiment confirms Corollary [4j namely that our covariance estimator enjoys an error bound that 
decays rapidly with n both in infinity and spectral norm. In addition, the curves in the second and fourth 
plot validate the error bounds in two ways. First, the fact that the curves in the second and fourth plots are 
flat validates that we have accurately captured the dependence between the error and the number of samples 
n, confirming the n -1 / 2 convergence rate. Secondly, the fact that these curves are tightly clustered suggests 
that we have also captured the dependence on d and m so that, modulo logarithmic factors, our bounds are 
sharp for this estimator. 

We also compare our approach with an algorithm that uses the same random projection for each vector. 
There are many algorithms of this form in a an, and, as a representative example, we use the algorithm 
of Halko, Martinsson, and Tropp fl4l . which we described in Section[2] Recall that this algorithm operates 
on both the rows and the columns of the matrix, so cannot actually be deployed in applications where the the 
data vectors (columns) are observed in a compressed fashion. We emphasize that this algorithm is designed 
for low-rank matrix approximation rather than covariance estimation and that the objectives in these tasks are 
considerably different. This comparison is more a demonstration that a shared compression operator is not 
suitable for unstructured covariance estimation than a criticism of their algorithm, or of matrix approximation 
more broadly. 

In Figure [2] we plot the spectral norm error of our approach (called CSL) and the algorithm of Halko, 
Martinsson, and Tropp (called HMT) as a function of the number of samples n. The data is drawn from a 
40-dimensional multivariate normal distribution whose covariance matrix has unit spectral norm. The main 
takeaway here is that our approach is consistent as n increases with d, m fixed while the HMT algorithm 
is not. Our approach is consistent because it uses a different random projection for each data vector and 
averages across these vectors, so it obtains estimates for all directions of the target covariance. In contrast, 
the HMT algorithm uses the same (data-dependent) m-dimensional projection and consequently only m <C d 
directions are ever observed. 

4.2 Guarantees for Subspace Learning 

In the subspace learning problem, the goal is to estimate the principal components of the data, which amounts 
to the leading eigenvectors of the covariance matrix. Specifically, let the covariance matrix E have eigen- 
decomposition Yli =1 A iVivf with Ai > ... > Xd > 0 and let 14 £ M. dxk be a matrix whose columns are 
the leading k eigenvectors (i.e. vi,..., vt)- The goal of subspace learning is to recover a projection matrix 
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Figure 2: Comparison of our approach (CSL) against the algorithm of Halko, Martinsson, and Tropp 
(HMT) fl4l for a 40-dimensional covariance estimation task. HMT is not consistent as n increases, while 
CSL is. This agrees with both our upper (Corollary [4} and lower (Proposition[I()|) bounds. 


n g ydxd t j lat j s c i ose t 0 i j t = y k \/' r in spectral norm. Recall that the spectral norm difference between 
two projection matrices is the sine of the largest principal angle between the associated subspaces. 

The usual approach to subspace learning is to first construct an estimate E for the covariance and use the 
leading eigenvectors of E as the subspace estimate. In our compressive setting, we form E via Equation [l] 
compute the eigendecomposition E = YL'!= \ and let fl be the projection onto v -\..... £ 7 . 

To describe our theoretical guarantee, we require the standard notion of signal strength for subspace 
learning, namely the eigengap 7/- = A& — A;, :+ | . If 7*. is large, then the principal subspace is well separated 
from the remaining directions, whereas if 7/. is zero, then the principal subspace is actually unidentifiable. 
Incorporating this signal strength into our estimation error bounds immediately implies the following result 
on subspace learning from compressive measurements. For clarity, we present this result in the Gaussian 
setting. 

Corollary 6 (Subspace Learning). Let X t ,..., X n ~ A/"(0, E) and consider the compressive sampling 
model under the assumptions in Corollary [4] For any 5 £ (0,1), with probability at least 1 — 8, 


log (2 d/8) 
n 

This error bound is a consequence of Corollary [4] followed by the celebrated Davis-Kahan theorem (5) 
characterizing how perturbing a matrix affects the eigenvectors. The only other result for this specific prob¬ 
lem is our previous work ED, which, as we mentioned, uses a weaker guarantee for covariance estimation, 
and hence gives a weaker bound. Other results for subspace learning in different measurement settings are 
similar in spirit to the one here lfl3l . 

4.3 Consequences for Distributed Covariance Estimation 

In distributed sensor networks, one is often tasked with performing statistical analysis under both measure¬ 
ment and communication constraints. In the distributed covariance estimation problem, the data vectors 
Xi,..., x n are observed at n sensors s±,... ,s n (i.e. sensor s* observes sample 27 ), and we would like to 
estimate the covariance structure of the vectors while incurring minimal measurement and communication 
overhead. Applications include environmental and atmospheric monitoring, where each dimension of the 


« 2 ||S || 2 f I d 3 log 2 (nd/8) | d 3 log 2 {nd/8) + 
7 fc 1 y nm 2 nm 2 
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data vectors is associated with a particular chemical, so that each sensor records the chemical concentrations 
at a particular location in the environment/atmosphere and the goal is to understand correlations between 
these concentrations. Typically, communication is with a fusion center that aggregates the measurements 
from all of the sensors and performs any additional computation. Our approach provides a low-cost solution 
to these problems. 

In our approach, each sensor s t makes m compressive measurements of the signal Xt, computes the 
back-projection &tXt and sends this to the fusion center. This approach has measurement cost O(nrn) 
and communication cost 0(nd) as d-dimensional vectors must be transmitted to the fusion center. If the 
orthonormal bases A t used for sensing are synchronized with the fusion center before data acquisition, then 
the sensors can instead transmit Ajx t , which would result in 0{nm) communication cost. As we saw, 
the error depends on the effective sample size nm 2 /d 2 so a practitioner can adjust rn to tradeoff between 
measurement overhead and statistical accuracy. One extreme of this tradeoff does not compress the signals 
at all during measurement; this naive approach has 0 (nd) measurement and communication cost. 

The other natural approach uses a single measurement matrix A £ R dxm at all sensors and obtains the 
observations A T x t . This protocol has 0{nm) measurement cost and either 0(nm) or 0(nmd) communi¬ 
cation cost, depending on whether the shared measurement matrix is synchronized prior to acquisition or 
not. However, Proposition p~0] below, shows that this approach is not consistent for the covariance estima¬ 
tion problem unless m = d, which offers no measurement savings. Thus, our approach offers a favorable 
solution as either measurement or communication cost is significantly reduced over existing approaches. 
Moreover, we precisely quantify the tradeoff between measurement and communication overhead on one 
hand and statistical accuracy on the other hand. 


5 Lower Bounds 


We now turn to establishing lower bounds for the compressive covariance estimation problem. These lower 
bounds show that, modulo logarithmic factors, our estimator is rate-optimal. This means that our estimator 
nearly achieves the best performance one could hope for in terms of the problem parameters n , to, and d. 

The quantity of study in this section is the minimax risk, which is the worst-case error of the best 
estimator. Specifically, it is the infimum, over all measurable estimators E, of the supremum, over all 
covariance matrices E, of the expected error (in infinity or spectral norm) of the estimator when the data is 
generated according to a distribution with covariance E. This is therefore the distributional setting, and we 
will subsequently address the distribution-free setting. Formally, we are interested in lower bounding. 


||E({($ t ,cl> t X ( )} t n =1 )-E|| 


lZ n (Q) = inf sup E 



t EG0 


where the norm || • || is either the infinity or the spectral norm and the class 0 is some subset of the semidef- 
inite cone in d dimensions. Here, the expectation is over the data vectors X- t ,.... X n , which are drawn 
independently and identically from some distribution P s with covariance E, and the projection matrices 
$ 1 ,..., which are drawn uniformly at random from the set of m-dimensional projections in R d (We 
use U to denote this distribution.). The estimator E is parameterized by both the projection operators <1), and 
the observations T,A r ,. We use lZ n , 2 {&) to denote the minimax spectral norm risk and 7L, 1OC (0) for the 
foe version. Note that the high probability bounds in Theorems [2] and [3] translate into upper bounds on this 
minimax risk in both norms for appropriate classes 0. Here we are interested in lower bounds. 

Let ©(foe, ij. d) denote the set of d-dimensional positive semidefinite matrices with foo-norm upper 
bounded by p. Our first theorem in this section lower bounds the minimax error when the data is 
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generated according to a zero-mean Gaussian with covariance matrix £ £ 0(^oo,r?, d), which means that 
we set P s = J\f( 0, £). 

Theorem 7 (1^ Lower Bound). // ^ < 1 and d > 2, then we have, 

oo(©(^oo,t?>d)) > 7 y 15^2 

For the spectral norm lower bound, let 0(^2, V: d) denote the set of d-dimensional positive semidefinite 
matrices with spectral norm at most 77. The following theorem lower bounds the spectral norm error when 
the data is generated according to a zero-mean Gaussian with covariance matrix £ £ ©(£2, t], d). 

Theorem 8 (Spectral Lower Bound). If J y-Pj -f^-j < 1 and d > 6, then we have, 


2(0(^2, *7, d)) > 


77 I d 3 

480 V nm 2 


These two bounds hold in the Gaussian setting and should be compared with the bounds in Corollary [4] 
While the constants and logarithmic factors disagree, we see that the leading terms in the rates match in their 
dependence on n,m, and d. Thus, our estimator achieves the minimax rate, modulo logarithmic factors. 
Note that in the lower bounds, one should set rj = ||£||oo or ||£||2 respectively, so that the bounds also agree 
in the dependence on the signal strength parameters. 

The proofs of these two results are based on a standard information-theoretic approach to establishing 
minimax lower bounds. The idea is to reduce the estimation problem to a hypothesis testing problem among 
many well-separated parameters and lower bound the probability of error in testing. One obtains this lower 
bound by applying Fano’s inequality, which requires upper bounds on the Kullback-Leibler (KL) divergences 
between the distributions induced by the parameters. Our main technical result is a strong data processing 
inequality (Lemma |T8|> that shows that compressing the distributions leads to a significant contraction in 
the KL-divergence. We prove this result by exploiting rotational invariance of the selected distributions and 
properties of Beta random variables. This contraction leads to much sharper lower bounds. 

Note that these results do not immediately imply distribution free lower bounds, as the definition of 
minimax risk includes an expectation over the data. Since our upper bounds hold in a distribution-free 
sense, it is also worth asking if we can establish lower bounds in that setting. This question can be answered 
in the affirmative, essentially be reverse-engineering the translation from Theorems [2] and [3] to Corollary[4] 
Specifically, in the following result, we generate data from Gaussian distributions but then are asked to 
estimate the sample covariance. We argue that if one cannot estimate the population covariance well, for 
which we can apply Theorems [7] and [8] then one cannot hope to estimate the sample covariance either. This 
gives a distribution free lower bound on the compressive covariance estimation problem. 

Our distribution-free lower bounds involve the same sample-dependent quantities that arise in the corre¬ 
sponding upper bounds. To that end, define 0(7700) to be the set of all n-sample data sets X £ W Ur! with 
II-^IIL < 7700• Define 0(77 g x ) in a similar way but with the constraint || 4 Ym =1 Xtxf ||2||X||2 >0 o < V Sl 
and 0(77 s 2 ) with the constraint - \\ x t\\i — Vs 2 - These are closely related to the sample-dependent 

quantities || X H^, Si, and Sn from before. We prove distribution free lower bounds under the assumption 
that the data set X belongs to one of these classes. 
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Theorem 9 (Distribution-Free Lower Bounds). There exists positive constants do, k \, rt 2 , and a function 
no : N —> N such that for each d > do and n > no(d), we have. 


inf sup E$~^ W [||E 
E XG0(77oo) 

inf sup E$ r ^ w [||£ 
e xee( VSl ) 

inf sup E$™^ w [||£ 
E XG0(ris 2 ) 



(7) 

( 8 ) 

(9) 


This theorem is best compared with Theorems [2] and [3] Ignoring logarithmic factors, the leading order 
term in the first bound. Equation 0, is Cl ^rjoo \J ^2 ^ which matches the rate in our distribution-free 
upper bound (Theorem |2|. We prove two distribution-free spectral norm lower bounds to match the two 
The first bound, Equation ([HJ, has leading order term O 


terms in Theorem 


dys-L 

nm 


while the second, 


Equation 


has leading order term O ( \/ dns 2 I. These match the terms in Theorem 


once we replace Sj 


with r/s, and S 2 with rjs 2 . The conditions on d and n enable us to apply Theorems 
have omitted the actual constraints for clarity of presentation. 

Lastly, we consider a different compression scheme where, rather than drawing an independent random 
projection for each data vector, we use the same random projection on every sample. As we have mentioned, 
this approach has been used in several recent papers to estimate structured covariance matrices 012 El 
The following bound shows that this approach is inconsistent for the unstructured setting. Intuitively, the 
challenge is that one simply does not observe d — m directions of the covariance matrix, so one cannot hope 
to estimate the energy in these directions. This intuition is formalized in the following. 


and [8] although we 


Proposition 10 (Shared compression operator lower bound). As long as m < d, 

inf sup E n ||T(n,n£II) - £|| 2 > \ (l - (10) 

T EM, V2 v a 7 

l|E||2<i7 


so that consistent estimation ofY, is impossible with fixed m-dimensional projection operator. 

Notice that in this theorem, the estimator T actually has access to a compressed version of the target 
covariance £. The expectation is over the randomness in the projection, and holds in an asymptotic sense, 
with no dependence on the number of samples n. Consequently, we see that consistent recovery of the 
covariance matrix £ is not possible unless m = d, in which case it is trivial. This shows that this fixed- 
compression sampling scheme is not suitable for the unstructured covariance estimation problem. 

In addition to the fact that this is a population level analysis, the proof also departs from traditional mini¬ 
max lower bound techniques in that we do not use a discretization of the hypothesis space, which in this case 
is the semidefinite cone. Instead, we lower bound the supremum with an expectation over a continuous dis¬ 
tribution and use the geometric structure of this distribution to explicitly lower bound the expectation. More 
specifically, we reduce the compressed covariance estimation problem to a compressed vector estimation 
problem, and we lower bound the expected error for this problem when the vector is distributed uniformly 
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on the unit sphere and the projection is distributed uniformly over the set of m-dimensional projections. To 
our knowledge, this approach to proving minimax lower bounds is novel, and we believe it will be applicable 
in other compressed sensing problems. 

6 Proofs 

In this section we provide proofs of our main theorems and corollaries. We begin by introducing some 
tools that we will use in the proofs, turn next to the upper bounds, and close this section with proofs of the 
lower bounds. To maintain readability, the proofs of many lemmas stated in this section are deferred to the 
appendices. 

6.1 Preliminary Tools 

We will make extensive use of the properties of the Beta distribution so we collect several facts here. A 
random variable u> supported on [0,1] is said to be Beta distributed with shape parameters a, j3 > 0 if it has 

probability density function p(u>) = —— B(a^) -where is the Beta function. 

Fact 11 (Properties of Beta Distribution). The following facts involving random projections and the Beta 
distribution hold: 

1. Let a ~ Xm ■ b ~ Xd-m b e Chi-squared distributed random variables. Then ~ Beta(y, :l f m ). 

2. Let lo ~ Beta(™, d ~,f n ). Then, 


eh= n 

j=i 


m + 2 (j - 1) 
d + 2 (j — 1) 


E[(w — tu 2 )] 


m{d — to) 
d(d + 2) 


3. Ifx £ M. d and $ £ R dxd is a uniformly distributed random rank m orthogonal projection, then, 

$x = ux + \/H-H||x|| Wa, 

where u> ~ Beta a € K d_1 distributed uniformly on the sphere and W £ i s an 

orthonormal basis for the subspace orthogonal to x, i.e. x T W = 0. 


To establish our lower bounds on spectral norm error, we will need a packing of the d-dimensional 
Euclidean sphere in a particular metric. This metric is the spectral norm between the corresponding rank 
one matrices, i.e. d(u,v) = || uu T — vv T \\ 2 , which is a metric provided that u is identified with —u. 
Asymptotic results of this form exist in the approximation theory literature, where one classical result is the 
Chabauty-Shannon-Wyner theorem m. For our purposes the following non-asymptotic statement, which 
is weaker than the results in the literature, will suffice. This lemma is proved in the appendix using the 
probabilistic method. 

Lemma 12 (Packing number lower bound). For any t £ [0.5,1], if di 1 — r 2 ) > 4, then there exists a set of 
unit vectors {vj in of size M > exp(— \/S)T~ d ^ such that \\vivf — v j v J || 2 > t for all i 7^ j. 
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For the subspace learning application we will also use the Davis-Kahan theorem, which is a standard 
result from matrix perturbation theory. The theorem characterizes how an additive perturbation to a matrix 
affects its eigenvectors. 

Theorem 13 (Davis-Kahan Theorem f8j). Let M,A £ Bi nx be symmetric matrices with eigenvectors 
Vi ,..., v n (resp. Ui,... ,u n ) and eigenvalues \i > ... > A„ (resp. Hi > ... > /i n J. Define 8\ = 
minj^j |Aj — Aj|. Then, 


sin Zuj, Ui < 


\\m-a\\ 2 

5 , 


Finally, we will use several standard concentration inequalities for Gaussian random vectors. 

Proposition 14. Let Xi ,..., X n ~ A r (0, E) in W 1 where d > 2. Then there exists a universal constant 
c > 0 such that for any 5 £ (0,1), the following tail bounds on the matrix X £ K nx<i hold, 


p (nxiu < v^HEHooiogfrd/j)) > i - <y 


P (||A'|| 2 ,oo < \J 2 tr(E) log(nd/( 5 )^ > 1-5 



All four of these are standard results. The first two follow from Gaussian tail bounds and a union bound 
over the n vectors. The third result is based on random matrix theory, and shows that the usual sample 
covariance matrix is a good estimator for the population in spectral norm (29). The last bound uses x' 2 tails 
to give foo norm bounds on the error of sample covariance matrix. 


6.2 Proof of Proposition [T] 

It suffices to consider n = 1 as the result will follow by linearity of expectation. Let x = X\ and <1> = T ,. 

We first prove the third claim of Fact[TT| To characterize ||<!>./;|||, by rotational invariance, we can think of 
$ as fixed and x as a uniformly distributed unit vector. By setting $ to project onto the first m standard basis 
elements and using Claim 1 of Fact |TTJ it is easy to see that || < I>at||| ~ w||^||| where ui ~ Beta(^, 

This implies that the angle 0 between x and <l>x is cos _1 (- v /w) or equivalently, that the magnitude of ( \>x 
in the direction of x is ||<l>a^|| 2 cos (6) = w||a;|| 2 . By the Pythagorean Theorem, the magnitude of in the 
orthogonal direction must therefore be ||a;|| 2 -\/w — w 2 , and this direction is chosen uniformly at random, 
subject to being orthogonal to x. This gives the identity = uix + y/oo — w 2 ||x||FFa, which is precisely 
Claim 3 of Fact QT] 

This identity means that. 


&xx T & = uj 2 xx T + (to — uj 2 )\\x\\ 2 Waa T W T + ojs/ co — w 2 ||x|| ( xa T W T + Wax T ) . 

By linearity of expectation we can analyze each term individually. By Fact[TT]and the distribution of a, we 
know that Ecu = I j, Ecu 2 = rn ^+ 2 ) 1 = ® anc * ® aaT : since a is distributed uniformly on 
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the d—1 dimensional sphere. This means that. 


E$a;a; T< h = ~r?~J~^~ xxT + ~JFj —^||x||2E(fRaa T FF T ) + E(w\/w — w 2 )||a;|| (a;E a T W T + WEctx T ) 
did I 2jj did — l - 2j) 


m(m + 2) T 
= —7-—-7 -xx 
d(d + 2) 

m(m + 2) T 
-xx 


m(d — to) „ „n t, 

y ' '\x\\ 2 2 WW t 


d(d + 2){d- 1)' 
m(d — to) . 
d(d + 2)(d- l) 1 


kill 


I- 


xx 


d(d + 2) 

m(md + d — 2) T m(d — to) ||2 


d(d + 2)(d- 1)‘ 


d(d + 2)(d — 1) 


0:115/. 


Note that Iklli = tr(xa; T ). Proposition[l]then follows by linearity of expectation, using the same expansion 
for all of the n samples, and rescaling by d 2 /to 2 . 


6.3 Upper Bounds 

Recall that Ei = 1 ^t x txf is the observed covariance. Define, 

y = d{dm + d — 2) d{d - m) 

m(d+2)(d-l) m(d + 2)(d-l) 1 J ’ 

which is the expectation of Ei from Proposition [l] The proofs of both infinity and spectral norm bounds 
follow by arguing that Ei is close to E and then using this fact to relate our estimator E to the estimand of 
interest E. 

foe-norm Bound: The main ingredient of the bound is an intermediary deviation bound on quadratic 
forms. 

Lemma 15 (Quadratic-Form Deviation Bound). Let d > 2. For any unit vector u £ W l , define b = 
max te[n] \ X J U \ an d c = max t6 [ n ] \/||tCt ||2 — (xf u) 2 . For any S £ (0,1) with 5 < n/e and log(l/5) < 

fl (d-i)^(d+i) ’ w dh probability at least 1 — 45 we have, 


’(Ei-E)t 


< 


d 2 log(2/<5) 8c 2 d 2 log(2/S) 


\/l81og(n/(5) ( b 


16 c 2 


d 2 log(2/ 6) 


b + 2 


c 2 log(n/i5) 


Proof. We give an abbreviated proof here in the interest of readability, a fully detailed proof is given in the 
appendix. The proof involves a careful concentration of measure argument and is the crux of our analysis 
for the £oo norm bound. First we use the distributional characterization of the projection operator to expand 
the expression it T Ei u in terms of several Beta random variables. Then, we perform a two-step analysis; we 
first control the randomness in one set of these random variables, leaving dependence on the other set. This 
gives a large-deviation bound involving the remaining random variables in several places. We next control 
all of these terms, which involves several more deviation bounds. 
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We seek a concentration bound of -u T S-| u around u T T,u. First, we express u T T,u as an average. Let 
b t = \xju\ and let c t = -\/||a: t || 2 — ( xfu ) 2 . Then, 

T - = d{dm + d- 2) T d(d - m) = I V d{m + 2) h 2 + ~ m ) r 2 

m(d + 2)(d — 1) m(d + 2)(d — 1) + 4 m(d + 2)(d — 1) 4 

We now expand the term involving Si. Recall that Si = Y^it=i ^t.Xtxf $t, so u t Y,iu = y-y J2t=i( x J^t u ) 2 - 

Now consider the term (xf$ t u) 2 . Write = w t u+ \/wt — w 2 Wa t where w t ~ Beta(y, yy),a t € 

K d_1 is distributed uniformly on the unit sphere, and i f’ is an orthonormal basis for the subspace orthogonal 
to u. Then, 


(xj^ t u) 2 = (uj t xju+ \JU t — (jJ 2 xJWdi 


Since at is distributed uniformly and independently of the other random variables and since \\xfW\\ = c t , 
we have that xfWa t =c t at^/vt, where a t is a Rademacher random variable (i.e., 1 or —1 with equal 
probability), and v t ~ Beta (y, yy). (Here, v t is the distribution of the squared length of the projection 
of a’jTffy ||a;^fF|| on a t , and oy captures the symmetry of the distribution of the random inner product. 
Specifically, v t = 1 if d = 2.) Thus, combining with the above, we have that. 


u t SiM 


d 2 

nm 2 


YjixjQtU) 2 

t= 1 


d d 2 
nm 2 


n ( 

y, f u t b t + 

t =i ^ 


- U 2 C t a t y/vt\ 


2 


d 2 

nm 2 


Bit + B2t + B 3t , 

t —l 


where £ lt = wfb?, S 2 t = (wt - w?)c 2 t't, and B 3t = 2a t b t CtOJt\/oJt - 

At this stage, the main technical difficulty of obtaining a bound on |u T S 1 « — « T S«| stems from the 
highly concentrated distribution of v t . Simply applying an ordinary Bernstein bound gives a result that is 
loose by a factor of up to d, which is due to the fact that, while the distribution of v t is highly concentrated, 
its range is by comparison large - namely v t has non-zero density on the entire interval [0,1], 

To circumvent this, we obtain a bound in two stages. By the triangle inequality. 


Silt — u t Tju\ < 


u T l±iu - E w „ M T S i u 


E„ 


w T S t w — ir Eu 


where E WtjCTt denotes expectation with respect to each ui f . at random variable, conditioned on a fixed value 
for vt.. For the first term, we fix all v t variables, and obtain a concentration bound on the first term above 
with respect to the a t and u) t variables. Namely, we obtain a bound of the form. 


vr Si u — E w a w T E t u 


>g(5,{v t }^i) <5, 


for all 5 > 0, and for a certain function g. This is done using an application of Bernstein’s inequal¬ 
ity and involves calculating the variances and ranges of B lt , /i 2f , and B 3t with respect to a t and oj t . 
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Then, we use specialized concentration bounds for the sum of Beta random variables to obtain bounds 


E„. 


'ti T £ t rt — u T T,u 


, and yet another tail bound for Beta random variables to control g. 


Consider all of the v t random variables bxed. Let V\ f . V 2t , and V 3t be the variances of f? lt , /i 2f , and B 3t 
respectively, with respect to a t and w*. which can be calculated using Fact[TT] We have. 


Var (Bu + B 2t + B 3t ) < 3(Vi* + V 2 t + V 3 1). 

As for the range, by straightforward calculation, we have, 


I5i 


B 2 


B 3 t\ < ^2 + 4 c t v t + T^btCty/vt ) • 


; ) 


Bernstein’s inequality now reveals that for the function g debited above we can set. 


9 & {*}?=!) = 


6 log(2/<5) 


n 


« - TXVu + V2t + Vat) 


2d 2 log(2/<5) 


max bf + -c 2 t v t + ^b t c ty /vt . 


3nm 2 te[n] 

Note that for d = 2, this completes the proof, since then v t are identically 1. The remainder of the proof is 
for d > 3. 

Next, using specialized concentration inequalities for sums of Beta random variables (see appendix), we 
obtain the following: for any 6 , Si > 0, with log(l/5i) < (jzyyrp+iy> 


E Wtj cr t U T YitU — U T YiU 


< 


d(d — m) / /24(c? — 2)c 4 log(l/(5i) /6c 4 log(l/J) \ 
m(d+ 2) l y n(d — l) 2 (d + 1) V (d, 2 — l)n ) 


with probability at least 1 — 6 — £1. 

Finally, it remains to bound g. It can be calculated that. 


V lt = 


d%i ( m(m + 2) (to + 4)(m + 6) f m(m + 2) 


to 1 


d(d + 2 )(d + 4 )(d + 6) 


d(d + 2) 


d A cf 


V 2t = 


m(m I I ( m + 4 )( r ™+ 6 ) 0 (^+4) \ 

1+ (d+4)(d+6) J. fd+4) ) 

( m(d—m) ^ ^ 

d(d+ 2) 

V d(d.+ 2) J 


V 3t = v t 4 


d i bt c t\ Tn(m + 2)(to + 4) 
m 4 ) d(d + 2){d + 4) 

and so to bound g it suffices to use the fact that, for any S > 0, 


1 - 


to + 6 

d + 6 


P (maxi/ f > ^log(n/(5) ) < S, 
\te[n] d ) 


(see appendix). This last bound allows us to control the V 2t , V 3t terms, and the rest of the terms in g. 


To complete the proof, we combine the resulting bound for g with the one on 
This is done in detail in the appendix. 


Ecj t .(T t u T Y, t u - u t Y,u 


□ 
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We are now able to prove Theorem^ Taking a union bound over all vectors u = ' ‘^' 2 ' for i,j£ [cZ] we 
obtain a bound on ||Ei — E||oo- Specifically, with probability 1 — 6 we have. 


|£l - £||oo < Kl 




~r ~\ -r I + ^2 


d 2 log (d/5) 


b + 


c 2 log(nd/i5) 


for constants ki, k 2 > 0, with b = HXH^ and c = ||X||2,oo- 

Using the relationship between E and Ei (Equation ([T}) and the equivalent relationship between E and 
E (Equation Q), we then have. 


IIS 


S||oo ^ 


m(d + 2 )(d 1) II 

d(dm + d — 2) 1 


£||oo 


m(d-m) ^ 
dm. + d — 2 1 


Slloo < 21| 


S|| 


OO 5 


which holds provided that d > 2. This, combined with the fact that ||X 112,00 < Vd\\X\U proves the 
theorem. 

Spectral-norm Bound: This proof is an application of a particular version of the Matrix Bernstein inequality 
that applies to unbounded random variables that exhibit sub-exponential tail decay. The result is due to De 
La Pena and Gine (Lemma 4.1.9 in 0) but we use a version available in Tropp’s monograph (Theorem 6.2 
in 1261 1. For completeness, we reproduce the result as Theorem |22| in the appendix. 

We first decompose the unnormalized sum E j. Using Fact 11 we have, 

^ E (u t Xt + \Jut- io^\\x t \\ 2 W t a^j x (u t Xt + \Ju t - u%\\xt\\2W t at 

= ~E (^t x txj + (w t - ujt)\\xt\\lw t a t ajw^ + u t y/u t - uj?\\x t \\ 2 {x t o^Wj + W t a t xf)j 

n 

= ~E (Xi,t + Y 2t + Ys t) > 

where we have defined. 


Yi it = uifxtxj 

Y 2 ,t = {u t - u 2 )\\x t \\\W t a t c^Wj 

Y 3}t = wty/ut - ||ar t || 2 ( x t aJWJ + W t a t xJ) . 

By linearity of expectation and the triangle inequality, 

II- £e||2 < E -Ey M || 2 . 

fc=i t=i 


The result follows from high probability bounds on the three terms on the right hand side, coupled with a 
union bound. 

To apply the Subexponential Matrix Bernstein inequality, we need to control all moments of these ran¬ 
dom matrices by particular functions of their variance. This is the content of Lemma 16 where we show 


19 

















that for each k £ [3], t £ [n] and any integer p > 2, 


e(y m - EY^r d 


(id 


for particular values R ^ and matrices A k t . The value Ilf. is independent of t and serves as range-like term 
in the deviation bound while the A\ t matrix appears in the bound through of. = 1 A'f. t and acts 

as the variance term. 

Lemma 16. Let S-\ , t = an ‘^ <?2,t = ||xt|| 2 f. F° r integers p > 2, k £ [3], and t £ [n\, 

Equation [TTj/io/cA with: 


Al t = 1680 — S ht , 


d 4 

,t = 48 —pS2,t, 


Ai f = 15 


m 


3 ’ 4 ^ d 3 \d 

Proof. See Appendix |B.3| 

We now use the definitions, 

1 


S2,t + S ltt , 


5i = 




E M&txJ 


Ri = 28-\\x\\i 


R 2 = 12-\\X\\l oc 
R 3 = 10^||A1I jOO . 


S 2 = 


-t n i n 

-Y. s » = 

71 < ^ 71 Z ' 


\x t \\ 2 - 


□ 


With these, the Subexponential Bernstein inequality applied to all three terms reveals that with probability 
at least 1 — 3c), 


ITl 2 .. - 1 

— Ei - E 2 < - 

d- n 


/- 

E V 2a i I °s( d / S ) + 2R i \°&( d / S ) 


d 3 




< 48t —5! + 13t —5 2 J - log (d/6) + 


d 3 


100?n||X ||r 

nd 


log (d/6). 


Adjusting for the normalization gives. 


Si - S|| 2 < 48 a / —Si + 13a/ ~^S 2 J - log (d/6) + 


100d||X||| jC 


log (d/d). 


Finally, the bound ||E — E|| 2 < 2||Ei — E|| 2 proves the theorem. 

Proof of Corollary[4j The corollary follows from Theorems [2] and [3] along with the Gaussian tail bounds in 
Proposition [74] Equipped with these bounds, we can plug in for | A - 1 j ^ and ||Af || 2i00 in Theorems [2] and [5] 
Then by the triangle inequality, we can bound ||E — E|| < ||E — 1 X t Xj || + ||S X^t=t XtXf — E|| 
and use the latter two bounds to complete the proof. 
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Proof of Corollary [5} The proof is based on a lemma of Achlioptas and McSherry 0] that controls the 
spectral norm of the matrix E/,. in terms of only the k dominant noise directions. Specializing their lemma 
to our setting shows that, 

||Efc-E|| 2 <2||E-E|| 2j 

which follows since E is rank k. To conclude the proof, we must control the quantities Si and S 2 in 
Theorem|3] Note that ,S'i < ||X||i )00 ||E|| 2 while 5 2 < ||X||f i00 . Proposition[l4]shows that with probability 
at least 1 — <5, we have the bound, 

11*111,00 < 2 tr(E) log(nd/<5) < 2fc||E|| log(nd/<S). 

Using this expression in the upper bounds for both 5i and S 2 and applying Theorem [3] produces the error 
bound. 

Proof of Corollary [6} Corollary [6] follows immediately from the spectral norm bound in Corollary [4] and 
the Davis-Kahan Theorem ('Theorem [T3]) which introduces the eigengap 7/.. 


6.4 Lower Bounds 

Our lower bounds employ a well-known argument based on Fano’s inequality. In particular, we will use the 
following result (See ll27t ). 

Theorem 17. Assume that M > 2 and suppose that a parameter space 0 contains elements 9 0 ,9 1 ,, 9 m 
associated with probability measures Po,..., Pm such that: 

1. d(9i, 9j) >2 s > 0 for all 0 < j < k < M. 

2. P j is absolutely continuous with respect to Pofor all j £ [M] and, 

1 M 

J]J2 KL ( P P F °) - al °S M > 
i=1 


with 0 < a < 1/8. 


Then, 


inf sup Pg 
e eee 


did, 9) > s 


> 


s/M 
1 + y/M 


2a 



To apply the theorem we need to control the Kullback-Leibler divergence between these distributions. 
The following lemma enables a sharp KL-divergence bound. 

Lemma 18 (KL-divergence bound). Let Pq be a distribution on (z. U) where U is an orthonormal basis 
for a uniform-at-random m-dimensional subspace, x ~ Af(0, pi) and z = U T x. Let Pi be the same 
distribution but where x ~ A/"(0, pi + 7 vv T ) for any unit vector v and any 7 £ R such that 7 > —p. Then, 

KL{Pf\\PS)<\^^. ( 12 ) 
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The lemma demonstrates that our compression model results in a contraction in the Kullback-Leibler 

2 

divergence between Gaussian distributions. As the KL-divergence between the two Gaussians is ©(^j-ro), 
this contraction is by a multiplicative factor of m 2 /d 2 . This is known in the literature as a strong data- 
processing inequality El HU EH and it allows us to easily adapt existing lower bound constructions to our 
setting. 


Proof of Theorem[7| The goal of this proof is to apply Theorem [T7] for a set of d + I distributions. The first 
distribution Pq has the data vectors drawn from A’(0. pi), while the distribution Pj has the vectors drawn 


from pi — pejej). Clearly the first condition of Theorem 17 is satisfied with s = 7/2. Secondly, 


the infinity norm bound on all covariance matrices is 77 and to ensure positive semi-definiteness we require 
7 < p. Lastly, by Lemma 18 we have. 


1 

-'£ / KL(P j \\P 0 )< 


3 7 2 nm 2 
2 p 2 d 2 


which means that we can set 7 = 77 


2a d 2 log d 


. The PSD constraint means that we require. 


2a d 2 log d 


< 1, 


which happens for n large. Theorem 17 now states that. 


inf supPs 
£ e 


|S - E||oo > 77 


2 a d 2 log d 


> 


Vd 

1 + \fd 


l — 2a — 


2a 


log d J 


:v 


We set a = 1/10 and provided that d > 2, the bound becomes, 

inf sup Pe 


|E - S||oo > r] 


1 dr log d 


15 nm 2 


1 

> -. 
“ 7 


Theorem[7]follows now by application of Markov’s inequality. 

Proof of Theorem [ 8 | As before, the proof is based on an application of Theorem [T7] but we will use 
exponentially many distributions. The first distribution I\ t has the data vectors drawn from 7V(0 ,77 1). For the 


remaining distributions, let {vj}jL 1 be the 1/2-packing in the projection metric guaranteed by Lemma 12 
We know that M > expf— 1 /8)2 d - /H . For each j, let Pj be the distribution where the data vectors are drawn 
from Af( 0, pi — 7 VjvJ). 

The first condition of Theorem [F7| is satisfied with s = 7/2, all covariance matrices have spectral norm 
at most 77 , and we require 7 < 77 . Lastly, by Lemma 
Plugging in for M, we require. 


18 


we have that the average KL is at most | ^ rj ypr ■ 


3 7 2 nm 2 (d 1 

2 T? 2 rr ~ a \ 8 ° S 

which is satisfied if we set y 2 = °^ R provided that d > 4. Setting a = 1/24 and provided d > 6 , 


the right hand side of Theorem 


17 


is lower bounded by 5/24 while the separation is 2 \/tt§ 2 ~jpr- The 
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condition involving n, to, and d is based on requiring that 7 < ?/ for positive-semidefinite-ness, and we may 
apply the Lemma 


12 


since d > 6. The constant ^ 


is a lower bound on ^ x ^ 


1152 


. Theorem 8 follows 


by application of Markov’s inequality. 


Proof of Theorem [9} The goal of this proof is to apply the lower bounds in Theorem [7] and [8] but we will 
have to show that introducing the distribution and the expectation does not significantly affect the lower 
bound. This is done by considering the distribution free problem where the data is generated according to 
a particular Gaussian, but the goal is to estimate the sample covariance. Our existing lower bounds show 
that one cannot estimate the population covariance well, while concentration-of-measure arguments show 
that the sample covariance is very close to the population version. Combining these two arguments with 
the triangle inequality gives a lower bound on estimating the sample covariance, which is a distribution free 
problem. 

To simplify the notation, we will suppress dependence on the projection operators and the sample itself 
so we can simply write E for the estimator and E for the sample covariance. In other words we replace the 
supremum over samples x” with a supremum over positive semidefinite matrices E. We must also make 
sure that a:" meets the norm constraints, which for now we will denote by the term “valid.” Letting || ■ || 
denote the particular norm of interest, i.e. spectral or we have. 


inf supP[||E - E|| > ei — e 2 ] 
s e 


> inf sup 

E E 

> inf sup Pe~p s 

E E 


||S-E|| > Cl -e 2 
[||E - S|| > ei - e 2 ] 


P| x" valid 

- Pe~p s \ x i invalid]. 


In the first line, the only randomness is in the projection operators, while in the second line we have lower 
bounded the supremum over sample covariances by a supremum over population covariances and an ex¬ 
pectation over sample covariances. The distribution E ~ Py, is the distribution over E = ^ x t x J 

where 27 * ~ d ' A r ( 0 . E). The second inequality is based on De Morgan’s identity and a union bound, and 
the consequence is we have separated the norm constraints of the sample from the estimation problem alto¬ 
gether. To bound the probability of the first event, note that by the triangle inequality, if ||E — E|| > e\ and 
||E — E|| 2 < e 2 , then ||E — E|| > ei — e 2 . This means, 

^E~P S [||^ — ^11 > e l ~ £ 2 ] 

>PE^p E [||S-S||> ei p||E-E||< e2 ] 

>Pe~pJ|S-£|| > ei ]-P^ P J|E-E|| >e 2 ]. 

The second inequality follows from De Morgan’s identity and a union bound. Note that the first term is 
the error in estimating the population covariance given compressed Gaussian samples, so we can apply our 
lower bounds from before. The second term does not depend on the estimator or the projections, and it can 
be controlled by standard concentration-of-measure arguments. Putting the terms together gives, 

inf supP[||E — E|| > ei — e 2 ] > 

s E 

inf su P P[||E - E|| > ei] - P[||E - E|| > e 2 ] - P[a? invalid], 

E E 


23 






We now derive the i^ norm bound. By examining the proof of Theorem [2] we know that with t\ = 


1 d 2 log(rf) 
15 nm 2 


the first term can be lower bounded by 4 provided that E is allowed to have (r X _ norm as large 


14 


and set €2 = 77 


l°g(2 d/5) 


SO 


as 77 . For the second term, we can apply the fourth inequality in Proposition 
that this probability is at most 5. Finally the third term constrains our setting of 77 , the infinity norm bound 
of the population covariance. The sample is valid if ||X< 77 ^ and the first inequality in Proposition [l4| 
shows that if, 


V < 


2\og(nd/S)' 


then the sample will be invalid with probability at most 5. Setting 5 = and r\ to meet the inequality 
reveals that with probability at least /., there is a constant K \ > 0 such that. 


IS-SIloo > 


«i77qq 

log (nd) 


Id 2 log(d) 


nm* 


iog(<*A 


This bound holds in a minimax sense and the first claim in Theorem [9] follows from Markov’s inequality. 
This bound requires the conditions on n and <1 from Theorem[7] 


The two spectral norm bounds follow in a similar manner. We can set y ^2 so that the first 

term is lower bounded by provided that E is allowed to have spectral norm as large as 77 in the construc- 


14 


with e 2 = V\/ clog n /S ^ 


the 


tion. The second term can be bounded by the third inequality in Proposition 
probability is at most S. 

For Equation (|8j, the sample is valid if ||E|| 2 ||X||| ^ < rjs 1 By applying the second inequality of 
Proposition[l4]the left hand side here is bounded by. 


\\X\ 


2,00 


<2d||E|| 2 log(nd/<S)(||E-E|| 2 


<2dm\llog(nd/S) (l + ^/ cl ° g W 5 A . 


Again setting 5 = 1/12 and if n is large enough, we may set. 


77 = c. 


VSi 


d\og(nd) ’ 

This choice implies that there is some constant k 2 such that with probability at least 1/24, we have. 


— S || 2 > k 2 


VSx 


d\og(nd) \ V nm 


d 3 


where we used the fact that d 2 /m 2 > d/m since m < d. 

For Equation (|9]), the sample is valid if / ^// =1 ||xt||| < r]s 2 - Again, the second inequality in Proposi¬ 
tion 14 gives. 


1 " 

- < max ||xtH 2 < (2 tr(E) \og(nd/5)) 2 < 4<i 2 ||5:||| \og 2 {nd/5). 

n te[n] 
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Setting S = 1/12, we want this to be at most r]s 2 which means we may set 77 to be. 


Vvsj 

T] = C --. 

d\og(nd) 

As in the previous case, this implies the lower bound. 


IIS 


S ||2 > K2 


d\og(nd) 




This holds for some K 2 > 0 with probability at least -pj . 

Proof of Proposition [TO) To prove Proposition [lOj we need one intermediate result. The following lemma 
lower bounds the minimax risk (in squared £2 norm) of estimating a vector when it is observed via a low¬ 
dimensional random projection. 

Lemma 19 (Lower bound for compressed vector estimation). Let U ( v) be the uniform distribution over 
vectors with norm v in and let V be the uniform distribution over m-dimensional projection matrices 
over R d . Then, 


inf sup En~p||T(n,IIx) - x\\% > inf ^En~p||T(II, Hr) - x\\l > u 2 (l - ^). 

T *:||®|| 2 =77 T d 

Proof. See Appendix |B.5| □ 

For the theorem, we lower bound the minimax risk by, 

inf sup En~-p||T(n,nEII) - S|| 2 > inf E a .^ w( /^ ) E n ~'p||T(n,na;a; T n) - xx r || 2 

T E,||S|| 2 <77 t 

> inf E,^ u(v ^ ) 77sin ^ (u max (T(n,nxx r n)), -), 

T n~v \ IFII2/ 


13 


(M) is the eigenvector corresponding to the largest eigenvalue of M. Here we are applying 
and using the eigengap for the matrix xx T , which is 77 . We can now eliminate the dependence 


where v n 
Theorem 

on T and instead take inhmum over estimators v (II. 11:/;) for the vector x. We may replace 11 ./;:/; 7 ! I with I lx 
since they contain precisely the same information. 

For unit normed vectors x and v, some geometry reveals. 


. . I\\x — ull 

sin Z(v,x) > y---. 

This applies since both vectors are eigenvectors, and hence normalized. The minimax risk is thus lower 
bounded by. 


inf 

v,IM|=i 


E x ~ M (i)T7 

n~-p 



IF 


7/(n,nx)||. 


Proposition [TO] now follows by applying Lemma [l9[ Notice that we apply Lemma 19 with v = 1 since the 
application of the Davis-Kahan theorem already accounts for normalization. 
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7 Conclusions 


In this paper, we studied the problem of estimating a covariance matrix from highly compressive measure¬ 
ments. We proposed an estimator based on projecting the observations back into the high-dimensional space, 
and we bounded the infinity- and spectral-norm error of this estimator. We complemented this analysis with 
minimax lower bounds for this problem, showing that our estimator is rate-optimal. We showed that this es¬ 
timator also adapts to low rank structure in the target covariance, and we mentioned applications to subspace 
learning and to learning in distributed sensor networks. Note that many other consequences, for example 
to the task of learning the structure of a Gaussian graphical model are immediate following the results of 
Ravikumar et al. |23l . 

The main insight of our work is that by leveraging independent random projection operators for each data 
point, we can build consistent covariance estimators from compressive measurements even in an unstructured 
setting. However, due to the absence of structure in this problem, the effective sample size shrinks from n in 
the classical setting to nm 2 /d 2 . This gives a precise characterization of the effects of data compression in 
the covariance estimation problem. 
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A Deviation Bounds 


Theorem 20 (Bernstein Inequality). If U\...., U n are independent zero-mean random variables with \U\t < 
B, a.s., and — Var((/j) < er 2 , then for any 5 £ (0,1): 


P 



< 


2ct 2 log(2/<5) 


log(2/(5) \ > 
2>n ) ~ 


Theorem 21 (Matrix Bernstein Inequality li26l ). Let Xj ,..., X n be independent, random, self-adjoint ma¬ 
trices with dimension d satisfying: 


EA'fc = 0 and ll^felb < R almost surely. 


Then, for all t > 0, 


k =1 



< dexp 



-t 2 /2 \ 
+ Rt /3 J 


where a 2 




fc=i 


Theorem 22 (Subexponential Matrix Bernstein Inequality ll26l l. Let A ),.... X n be independent, random, 
mean zero, symmetric matrices with dimension d. Assume that there exists R £ K. and A & £ M. dxd , k £ [n] 
such that: 


E(X£) d ^R p ~ 2 A 2 


forp = 2,3,4,... 


Then for all t > 0, 


k =1 


> t < dexp 


-t 2 /2 
a 2 + Rt 


where cr 2 



Theorem 23 (QjD). Let A'i, ..., X n be independent random variables with EA' 2 < oo and X t > 0, a.s., 
for all t £ [n], Then for any e > 0: 


F ( E (£ x, )'§H Sexp teiW) 

Theorem 24 (8281). Let A 1; ..., X n be independent random variables with EA t = 0 and EA 2 = b t and: 

E|A t | fe < |fc!a fe - 2 , 

for all t £ [n], k > 3 and for some constant a > 0. Then: 


Eexp 



< exp 


f^Uh\ 

\ 2(1 - 8 ) ) 


for any s £ (0,1) and A > 0 provided that A a < s. 
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Proposition 25. For any t £ [n] and d > 3, let ct > 0 ant/ z^ t ~ Beta (y yy). Define c = max tg [ ra ] c* and: 

2{d — 2 )nc 4 


B = 


(d-l) 2 (d+l) 


For any s £ (0,1) and 5 > 0, z/log(l/<5) < > then: 


y t=l £=1 


l 2Blog(l/5) \ 
n 2 (l — s) j ~ 


Proof. For each t £ [n], we have c 2 z 2 t > 0, Ec 2 ^ = yyc 2 , and E(c(t , t) 2 = cf d2 3 _ 1 < . So by 

Theorem 23 for any e > 0, we have: 


i n i i n \ 

- Vi — 7 c t -J2 c t Ut - ~) - exp 

77 ^—* (2 — n z ^ rt / 

y £ —1 


and the result follows by inverting the inequality. 




n 


Proposition 26. For any t £ [n\ and d > 3, /ef c t > 0 anal Z 2 t ~ Beta( 1 yy). Define c = max (e u c t . 
Then for any 5 > 0: 

P^V 1 c 2 1 Vc 2 r - /6c 4 log(l/5)\ 

"hi n ( d2 -v ) 


Proof For t £ [n], define A7 = c 2 — yy ^ and note that EA7 = 0. Let bt = EA' 2 = c 
Then for any k > 3 we have, using Minkowski’s Inequality: 


2 _ „4 2) 

‘ (d-l) 2 (d+l)' 


E|A t | fc = cf E 


vt - 


d- 1 


< cf (W Y' k + 


= c; 


,2k 


'* -1 1/2 + r 


n 


+ 


_ n (d—\)/2 + rj d—1 


\ r =0 


We now leverage two claims: 

Claim 1. For k > 3: 


max 


1/2 + k 1 M 10 

x 


< 


fd— l)/2 + k k + l'd— 2(d + 3)y d + 5 
Proof of claim. The proof is elementary. 
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Claim 2. For any £ > and k > 3: 


d- 1 60 


fc-i 

n 


1/2+ r 


d — 2 d + 3 11 (d — l)/2 + r 

r =3 


< fc!C 


fc -2 


Proof of claim. We proceed by induction. For k - 3, the expression simplifies to the second term in the 
previous claim: 


d - 1 60 


< 6C 


d — 2 d + 3 

For the inductive step, assume the claim holds for fc > 3, then for k + 1 we have: 

fc-1 


d-1 60 tt V 2 + r < 1/2 + fc fclC fc ~ 2 < (fc + 1 yc*- 1 

d-2d + 3l = l (d-l)/2 + r - (d-l)/2 + Jfe ^ M 


where the first step is the inductive hypothesis and the second is from the first part of the previous claim. 


4 d 


Armed with these two claims we can proceed with the proof of the proposition. For any a > we 
have: 


fk—l . \ 1 /^ 

E|X,| fc <cf I ( []^/ 2 + r ' 


\ r=0 
fc—1 


(d - l)/2 + r 


d-1 


< 2 k c 2 t k n 


1/2 + i 


r=0 


(d - l)/2 + r 


= 2 c: 


k J2k 


15 


fc —i 

n 


1/2 


(d-l)(d + l)(d + 3) Al(d-l)/2 + r 


< 


< 


(d-l)(d+l) 

r 4 

c t _ 

(d-l)(d+l) 

r 4 

c t 


(2c/) 


2\k—2 


60 T~r 1/2 + r 

d + 3 M ( d ~ l )/ 2 + r 


(2c?) fc -2^—?Jfe! 


fc -2 


d — 1 V d + 5 


d 2 ,, k _ 2 _ b t k _ 2 


. . , , -{2c 2 t ) k - 2 f—^k\a*-' 4 = ^/c!a fe 

(d — l)(d + 1) v d-1 2 


Here the first inequality is from the application of Minkowski’s inequality above, the second uses the fact 


1 


that ——- < f or an y r. in the third line we use pull out terms from the product, using the notation 

2 

that n; =3 (-) = 1- Then we apply the claim from before with ( = ——- and finally substitute in for a and 


b t . 


d + 5 


Now setting a = we have that the moment bound above holds for all X t . Let s £ (0,1) and A > 0 


such that A a < s. Since B = ^ t, by Theorem 


24 


we have: 


Eexp ( A ^Z x t ) < exp 


t= 1 


B\ 2 


2(1 -a) 
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We may now apply the Chernoff trick, so that for any e > 0: 


X t > e ) < exp ( —Ae 


B A 2 
2(1 - s) 


= exp ( A 


B A 

2(1 -a) 


1 — s 1 — s s 

Set A = ——— e, so that if ——— e < we have: 

B B a 


^2 X t > e ) < exp 


-e 2 (l~ g ) 
2 B 


Inverting this inequality and the condition above proves the result. In particular, we require that for the S > 0 


that we choose. 


B 


1 — s 2Blog(l/S) s 


< —. If this is the case, we have: 
1 — s a 


5>*V-1 


2B log(l/S) \ 


< <5 


— s 


□ 


Proposition 27. Ifv ~ Befa( 1, c ^ L ),ford > 4 f/;en P(z^ > log(l/<5)) < 5 for any 6 > 0. 

Proof. Let a G R d_1 be uniformly distributed on the unit sphere and let £ ~ Beta(l, Then a(l) 2 = v 
and a(l) 2 + a(2) 2 = (. So for any e G (0,1): 

P(j/ < e) = P(a(l) 2 < e) > P(ct(l) 2 + cr(2) 2 < e) = P(C < e) = P(— log(l — () < — log(l — e)). 

It is well known that — log(l — () is exponentially distributed with rate rl 2 :i , and so: 

P(- log(l - C) < - log(l - e)) > P(- log(l - C) < e) = 1 - exp 

□ 


B Proofs of Technical Lemmas 


B. 1 Proof of Lemma [12] 


The proof is based on the probabilistic method. We will first show that for any fixed unit vector x , if we 
draw another vector v uniformly at random, then: 


P[||a:a" T — vv T \\2 < r] < exp 


^ -dlog(l/r) + l ^ 


Armed with this deviation bound, if we draw M points uniformly at random from the unit sphere in d 
dimensions, then the probability that no two points are within r of each other is (via a union bound): 


m j ■ i \vivf 


Vjvjh > r] = 1 -P 


IJ II ViV? - vjvJW < T 


(M\ (-d\og{l/T) + 1 

UM— 4 — 
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As long as this probability is non-zero, then we know that there exists such a packing set. In particular, if: 

M(M — 1) / — dlog(l/r) + 1 

- 2 - eXP (- 4 - 

then we would show existence of a packing set of size M. This inequality is satisfied whenM > exp(—l/8)r 
To prove the deviation bound above, note that since x, v are unit vectors, the spectral norm difference is 
just the sine of the principal angle between their subspaces, which is just the sine of the angle between the 
two vectors. It is well known that (see (71): 

P [{x T v) 2 > p/d\ < exp Q(1 - p + log/3)^ 

P [cos 2 (Z(:c, i>)) > (1 +e)/d\ < exp Q(-e + log(l + e))^ 



Therefore: 


sinZ(s, v) < y/l — (1 + e)/d = P [cos 2 Z(x, i>) > (1 + e)/d] < exp [ -(—e + log(l + e)) 


If e > 3, then e — log(l + e) > e/2, so we can upper bound the probability by exp (—3e/4). Setting 
e = d{\ — t 2 ) — 1 gives the inequality: 


sinZ(x, v) < >/1 — (1 + e)/d 


< exp 


—d(l — t 2 ) 2 + 1 


We now proceed to lower bound (1 — r 2 ) by log(l/r). This is possible for r £ [0.5,1] as both functions 
are monotonically decreasing in r but (1 — r 2 ) is concave while log(l/r) is convex. At r = 1/2 the first is 
larger than the second, and they are both equal at r = 1. The condition on r and this lower bound establishes 
the inequality used above. 


B.2 Detailed proof of Lemma [15] 

Here we provide a more detailed proof of the quadratic-form deviation bound. Recall the claim to be proven: 
for d > 2, for any unit vector u £ and for any 5 £ (0,1) with <5 < n/e and log(l/<5) < 1 

with probability > 1 — 46 we have 


’(St-S)t 


< 


d 2 log(2/<5) 8c 2 

2 J 


nm* 


d 2 log(2 /S) 


nm* 


\/l81og(n/(5) ( b‘ 


16 c 2 


d 2 log(2/S) 


6 + 2 


c 2 log(n/i5) 


where 6 = max te[n] \xju\ and c = max te[n] Vll^lll “ i x I u ) 2 - 

The proof involves a careful concentration of measure argument and is the crux of our analysis for the 
e x norm bound. First we use the distributional characterization of the projection operator to expand the 


d/8 
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expression u T Y>\u in terms of several Beta random variables. Then we perform a two-step analysis; we first 
control the randomness in one set of these random variables, leaving dependence on the other set. This gives 
a large-deviation bound involving the remaining random variables in several places. We next control all of 
these terms, which involves several more deviation bounds. 

Let b t = \xfu\ and let c t = \/||£t || 2 — (xju) 2 . Then: 


Eit = 


d(dm + d- 2 ) rp _ 

m{d + 2){d — 1 ) m{d 


1 d(m - 

n m(d - 

1 d(m - 


-^(xju ) 2 + —— 
2) m(d 


d(d — m) 

2 )(d-l) 

d(d — m) 

2)(rf- 1) 


tr(S) 




n 


2) 6 2 

^ m(d + 2 ) 4 


d{d — in) 
m{d + 2 )(d — 1)* 


We now expand the term involving Si. Write u = totU+^/uit — uj 2 W at where u) t ~ Beta(j=f, at £ 

M d_1 is distributed uniformly on the unit sphere, and W is an orthonormal basis for the subspace orthogonal 
to u. Then: 


j 2 n / ,_ \ 2 

U T S 1U = — V ( LOtxJu + Juj t -Ojfxf Wat ) 
nm z f V v / 

= -o XI ( + ( w t - ix 2 ){xfWa t ) 2 + 2u t xJuJu t ~ u 2 xJWa t 

n.mr z ' \ v 


d d 2 


nm* 


nm* 


X + 2atu t xJusj- u%\\xfW\\2y/vt 


X ( w t^ 2 + ( w t ~ UJ ‘t) ( n v t + 2cr t 6tC t w t Y / wXw 2 v / ^J 

t=i ' ' 


— — V Sit + s 2 t + s 3i 

n ' 


Here z/ ( ~ Betaf |. ' 4 2 ’ 2 ), and = 1 if d = 2, while cr t is a Rademacher random variable, i.e. it takes 
value —1 with probability 1 and value 1 with probability In the last line we define B\ t = ^ 2 -w 2 6 2 , 
B 2 t = ~ and B 3t = £ z 2a t b t c t uj t yju t - w 2 ^. 

The first equivalence follows from writing Ex = Y^t= i ^t.XtxJ^t and grouping the projections 
instead with the u vectors. The second equivalence is just an expansion of the squared term. For the 
third equivalence, notice that xJW £ K d_1 while at £ K d_1 is distributed uniformly on the unit sphere. 
We can think of at as a one-dimensional projection operator, and by the geometric argument from before, 
we know that the squared norm of the projection is distributed as a Betaf '-^r) random variable, scaled 
by the squared-length of the original vector xJW. We use the same argument for the third term, except 
we introduce the Rademacher random variable because the xJWat is symmetric about zero. The fourth 
equivalence follows from the fact that WW T = I — uii 1 and therefore ||a;^kF || 2 = xfWW T Xt = HaJtHi - 
{xju) 2 . 

Now consider all of the u t random variables fixed, and we will develop a deviation bound for the remain¬ 
ing randomness. We will apply Bernstein’s inequality, so we need to bound the variance and the range. By 
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Fact fill we know that: 


dX 


Var (B lt ) = Xr ( E ^ - ( Ew ?) 2 ) = 


mx 


Var (B 2t ) = 


4 t 

i “ 


m(m + 2 )(m + 4)(m + 6) 
m 4 l d(d + 2) (d + 4) (d + 6) 


m(m + 2) 
d(d + 2) 


(to + 4) (to + 6) 0 (to + 4) 

+ (d + 4)(d + 6) (d + 4) 


Var(f? 3t ) = 4 


d A b 2 c 2 to(to + 2)(to + 4) 
to 4 ^ e?(d+2)(d + 4) 



= V 3t 


Therefore, 


Var (Bit + B 2 t + B 3t ) < 3(Vi t + V 2 1 + V 31 )- 

As for the range, by straightforward calculation, we have: 

d 2 ( l 2 _ 

|Bit + i? 2 t + B 3t \ < —o &t + 7 Ct^t + -XtCty/vt 

mr \ 4 3 

The last term is actually maximized when u) t = 3/4 and takes value 3\/3/8 < 2/3. Bernstein’s inequality 
now reveals that with probability at least 1 — 5: 


T,\u — E^ 


Ei it 


< 


6 log(2/5) 


\ 


1 ” 

- y^(vit ■ 

71 Z ^ 


T , , 2 d 2 log(2/5) 

V 2t + V 3t ) H---^—- max 

6nm s te[n] 


h + 4 C t I/ * + g btCty/vt 


Next we obtain a bound on 


E a 


t u T Eiw T — u t YiU 


the needed result. The expectation here is: 


, and combine these via the triange inequality to prove 


E, 


ut,»t 


u T t lU = — 


d(r 


2 )r 2 


n m{d 


2) ht 


d(d — to) 2 
m(d + 2) * 4 


So in expectation over a;*, cr t , by substituting in for u T YiU, the left hand side of the application of Bernstein’s 
inequality is: 


E aJtj(Tt zi T Ei'u T — u t Yiu 


1 d(d — to) 2 d(d — to) 2 

n m(d + 2) 4 4 m(d + 2)(d—1) 4 


(13) 


Note that if d = 2, this quantity is identically zero. We are left to control all of the terms involving the v t 
random variables for d > 3. By Proposition [25] we have that for any 5j > 0, provided that log/1 / 8\) < 

n d(d+5) 2 ^ n (d—2)(d+5)~ 

12 (d-l) 2 (d+l) — 4 (d-l) 2 (d+l)’ u,cu - 


P 



1 

d-1 


> 


24(d — 2)c 4 log(l/5i) \ 
n(d — l) 2 (d + 1) J 


< 5 -, 


where c = max te r„i c t . 
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By Proposition [26] we have that for any 6 > 0, 


6C 4 lQg(l/<$) \ r 

(■ d 2 - l)n - ’ 


P 



1 

d- 1 


c 


2 

t 


1 

n 


X! > 

£=1 




These two bounds control the deviation in the right hand side of Equation[l3] Finally, by Proposition[27]we 
have that for any 5 > 0 and d > 4, 


P fmaxz/ t > ^log(n/(5) ) < 6 
\te[n] d ' J 

which is also trivially true for d = 3 and d = 2. This last bound allows us to control the V^t, V 3 1 terms. 
Specifically, we have: 


d 4 b 4 f m(m + 2)(m + 4)(m + 6) /m(?n + 2)\ 2 \ A , 

u ' to 4 ^ d(d + 2)(d + 4)(d + 6) V d(d + 2) ) )~ 1 

d 4 c 4 641og 2 (n/d) / m(m + 2) / (m + 4)(m + 6) 9 (to + 4)\ /m(d — to)\ 2 \ A . 

— to 4 d 2 y d(d + 2) \ (d + 4)(d + 6) (d + 4) J \ d(d + 2) J J 2 

< 4d 4 6 2 c 2 8log(n/<5) m(m + 2)(to + 4) / m + 6\ A , 

3t — m 4 d d(d + 2)(d + 4) \ d + 6y 3 

where b = max (6 u It also controls the terms cf v t in the range term of our application of Bernstein’s 
inequality. Combining all of the bounds gives: 


u t Yi\u — u 1 YiU 


< 


6 log(2 / <5) 


y/V{ + V 2 + V%+ 


+ 


+ 


2d 2 log(2/d) 
3 nm 2 

d(d — to) I 
m(d + 2 ) 1 \ 


1 9 8 


8 


& + -j;c - log(n/d) + -bc\j - log (n/6) ) + 
^ 12 log(l/<5i) 2(d — 2)c 4 


/ 6c 4 log(l/<5) \ 
(d-l) 2 (d+l) + Y (d 2 — l)n J 


The second term on the right hand side can be upper bounded by: 

2d 2 log(2 /S) A | [ iog(n/8) \ 

3 nm 2 l V d I 

While the third term, by setting <5 = di can be upper bounded by: 

8d(d — to) /c 4 log(l/<5) 8d /c 4 log(l/<5) 

?n(d + 2) y n(d 2 — 1) — to V nd? 
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We have bounds for V{, V.[ and Vi: 


b 4 d 4 m(m + 2) 
m 4 d(d + 2) 

128c 4 d 4 log(n/<5) m(m + 2) 
m 4 d 2 d(d + 2) 

32 d 4 b 2 c 2 log(n/fe) m(m + 2) 
m 4 d d(d + 2) 

Here we use the fact that m < d so terms of the form 


v;< 


v,; < 


V'< 


tj -.—r < 1. This means that for 8 < n/e we have: 

(a+x) — — ' 


\JV[ + V2 + V3 < 


I m(m + 2) 


< 


3 d 2 


log (n/S) fe¬ 


rn 2 y d{d + 2) 
16c 2 ' 


log(n/<5) fe 


16c 2 


Putting everything together proves the claim. 


B.3 Proof of Lemma H6l 


We first derive the bound for Y\ , which is the simplest of the three. Notice that EYi = {JsLlo 2 )xx t . 

E(Yi - EYi) p = E (w 2 - Ecu 2 ) 2 ||s|| 2 &’- 1 >a:a; T 

We now proceed to bound the central moments of the random variable w 2 . Notice that since oj ~ Betaf 2 ^. 4 ,J n ) 


: know the non-central moments by Fact[TT] We also have the bound: 

for 1 > 0 


m + 2 i „ to . „, 

—r—— < 2 —(* + l) 
d -(- 2z d 


This can be seen by. 


d to + 2 i 
to d + 2 i 


1 + 2 i/m 
1 + 2 i/d 


< 1 + 2 i/m < 2{i + 1), 


for i > 0, d > 0, to > 1. 

Now to control the central moment of oj 2 , we apply Minkowski’s inequality: 

|E(w 2 - E(w 2 )) p | < E|w 2 - Euj 2 \ p < ((Ew 2p ) 1 / p + Etu 2 ) P 
( / 2 y-j- 1 m + 2i\ m(m + 2 ) \ 

y y “+ d + 2i J d(d + 2) J 

Now notice that since m < d the term ™+ 2 ? < 1 for all i but also the expression is monotonically increasing 
with i. This means that we can bound: 


?b(to + 2) (to + 2 i)(m + 2 (i + 1)) 
d(d + 2) — (d + 2 i)(d + 2(i + 1)) 
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This bound implies that the second term above is always smaller than the first, which leads to: 


2p—1 


|E(w 2 — E(w 2 )) p | < 2 P Yl 


i =o 


m + 2 i 
d + 2i 


= 2 P 


< 2 P 


= 2 P 


< 2 P 


2p— 1 

m(m + 2)(m + 4) -j-i- m + 2i 


d(d + 2 )(d + 4) 


n 

i=3 
P+1 


d + 2?' 


m(m + 2)(m + 4) y-|- to + 2* 


d(d + 2)(d + 4) 


n 


to(to + 2) (to + 4) Vt m + 6 + 2i 


d(d + 2 )(d + 4) 


i—3 
P-2 

n 

i=0 


d T 2i 


d T 6 T 2i 


ra(ra + 2)(m + 4) / m + 6 


+ 2)(d + 4) \ d + 6 
m\P~ 2 


p -1 


(P- !)! 


£ (28^) 

2 V d/ 


Tfl 

<^(28^) 4 x 28 x 3x5-^- 


The first inequality is based on the argument above, that the second term in the application of Minkowski’s 
inequality can be dominated by the first term. The second inequality follows since p>2so2p—l>p+l 
and the fact that the terms of the form are at most 1. The third inequality follows from the bound 
derived above on terms of the form ■ The last line follows from the fact that ^±4 < ^ (i + 1) applied 
to all terms of that form and the bound (p — 1)! <p\/2. 

Putting things together, we have: 


E(Yi 


EYi) p ^ 


p! 

2 

p! 

2 


( 28 ^) P 1 680^ r ||a;||2 (p l) xx T 

( 28 ^IMI i) P 1680 ^\\x\\lxx T , 


which proves the first claim. 

For the claim involving Y 2 , notice first that for a unit vector it, we can exploit orthogonality to write: 

(auu T + bl) p = ((a + b)uu T + b(I — uu T )) p = (a + b) p uu T + b p (I — uu T ). 

We will apply this identity on the term involving Y 2 . Notice also that, 

o o WW T 

Ey 2 = (E W - W 2 )||x|| 2 ^ T . 
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Since W T W = Id- 1 and since a and uj are independent and ||cr ||2 = 1, 


E (Y 2 - EF 2 ) P = E ((w - u; 2 )\\x\\ 2 Waa T W T - E(w - w 2 )||a;|| 


,WW 

d- 1 

rT 


T\P 


= \\x\\ 2 2 p WE — uj 2 )aa T — E(cu — uj 2 ) W 

2 E (uj — cj 2 )\ p T /— E(w — w 2 )^ p 


= \\x\\l P WE 


= \\x\\?WE 


UJ — UJ — 


d- 1 


aa 


d- 1 


(Id-i - «a T ) 




d-1 / d-1 


fi-E ij j _ 1 


d- 1 


= \\x\\o P WW t E 


uj — UJ — 


E(w — uj 2 ) 


2 \\p 


1 


— E(w — uj 2 ) 


2\\ P 


d- 1 


1 - 


W 


1 


d—1 / d—1 \ d—1 / \ d—1 

As before, we now use Minkowski’s inequality to bound the term involving the Beta random variables. 

,2 i\p 1 


E 


2 E(w — uj 2 ) 
UJ — UJ — 


d- 1 


d- 1 


-E( W -cu 2 )V 1 


< 


1 


< 


< 


d- 1 

1 

1 

d^T 


-E 


2 E(w — uj 2 ) 


2t\ Pi 


d — 1 y V d — 1 

E(w — w 2 ) x p 


d- 1 


2x P \Vp , E(w-w 2 ) 


2M P 


(E(w - w 2 ) p ) /P + 


+ 


(E(w p )) 1/p ■ 


Ecu 


d- 1 

i p / Ew ^ p 


E(u> — uj 2 ) 
d-1 


2 \\p 


d-1 


d-1 


Here the second line is based on an application of the triangle inequality, while the third line follows from 
Minkowski’s inequality on the first term. In the fourth line, we use the bound E(w — uj 2 ) < E(w) on all 
terms, which is valid since uj £ [0,1], As in the bound for Y\, we now use the fact that: 


d 


^m + 2i\' lr 


m m + i . , . „ to / -r-r 

-T- < - — v z > 0 => — < TT , „ 

1 +1 d l d + 2z 

\ 2=0 


Applying this bound to both terms involving Ecu = ^ gives. 


1 


d-1 


(E(w p )) 1/p + 


Ecu 


d-1 


1 


< 


d-1 
1 

d~ 


0^ m + 2i \' / ’ 


Ecu 

d-1 


m 


\i—0 


d + 2i 


d(d — 1) 


TO 


d(d — 1) 


(1 


1 p \ TT TO + 2i 


d- 


n 

7 2—0 


d + 2i 


< 


2 p + 1 P T-r m + 2z 


in 


d— 1 d-\-2i 

2 = 0 
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We now use the same upper bounds as we did to control Y-\, 


2 P + 1 
d- 1 


p-i 

n 


i =0 


m + 2 i 
d 2i 


2 P + 1 m(m + 2) y-j 2 m + 2 + 2i 
d- 1 d(d + 2) H d + 2 + 2 i 

i—1 


< 2 P + 1 m(m + 2) / m + 2 
— d—1 d(d + 2) \ d + 2 





p -2 

(P- !)! 


Combining this with the derivation above gives: 


E(F 2 - E Y 2 ) p ± ^ (l2^)" 2 48 ^||^||^WW t 



which, along with the fact that / — uu T < I , gives the bound for Y 2 . 

Finally for Y i: note that E Y :i = 0 which is clear because W is an orthonormal basis for the subspace 
orthogonal to x. This for odd p, we have EYg* = 0 while for even p, 

EKf = E (uVu}-lo 2 Y \\x\\ p 2 E(xa T W T + Wax T ) p 

= E (Ww-w 2 ) P ||at||§ {\\x\\ p f 1 xx T + IMIf^y) 

= E (u\/uj- w 2 ) ll^H^ -1 (xx T + \\ x h W d W x j 


The only non-trivial step here is the second one, where we note that ( xa T W T + Wax T ) 2 = (xx T + 
||at|| 2 VIEct;Q: :r TT rT ’) by direct calculation and exploiting orthogonality of x and W. By further exploiting 
orthogonality this expression to any natural power is equal to taking each term to that power and this gives 
the expression in the second line. 

Now for the term involving u>, since u> £ [0,1] and p is even. 


E (uy/w^u*) p = E(w 3 / 2 v / T^) p < Ew 3p/2 = E* = o 3 ^ 2 - 1 "^^ 

a + 2z 

= m(m + 2)(w + 4) = 3p/2 _i m + 2i, 

d(d + 2)(d + 4) d + 2i 

^ to 3 3j jy 3 to + 4 + 2i to 3 y? to + 4 + 2 i 

~ ° d 3 A1 d + 4 +2i ~ d 3 d + 4 +2i 
1=1 1=1 


< 15 


TO 3 

h 3 


f o m + 4 
V d + 4 


p! 

(P~ !) ! < j 


m\p ~ 2 




15 


TO 3 

l 3 


This derivation uses all of the same steps as in the previous two cases. The only thing to note is that we use 
the bound ‘ip/2 — 3 > p — 2 which holds as long as p > 2. Combining this with above gives: 


Elf E 


P[ 

2 



1r TO I, ii 2 

15 ^r 11*112 



2 ww T 

d 


+ XX 
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The last step is to use the fact that WW T = I — fpw. This proves the lemma. 

II ^112 

B.4 Proof of Lemma [18] 

We will prove Lemma [l8| for the distributions based on A r (0. ///) and A r (0. ?// + jeief). By rotational 
invariance, the bound holds if we replace e\ with any unit vector v. The KL-divergence for a single sample 
is: 


KL{ Pi,P 0 ) 


A/"(0, U t 'E 1 U) Unif((7) log 


/ A/”(0, U T EiU) Unif((7)\ 
\^(0, C/ T S 0 C/) Unif(Ef )) 


Eu^ Vni{ KL{U(0, U T ^U)M{0, U T Z 0 U)) 


1 /I 


Ec/^unif;: - tr (r)I m + lU T e^U) - m - log 


det(7y I m + jU T eiej U) 

TjUl 


Here Unif is the uniform distribution over orthonormal bases for m-dimensional subspaces of M. d . To 
analyze the quantity inside the expectation, let Ai,..., A m denote the eigenvalues of r)I m + r yU T e\ e[U 
and write: 


1 l 


det (rjlm + jU T eieJU) 


7 : ( - + jU T eiei U) - m - log 

2 \r] ,/ 

1 / m \ i / m 

= d H A *h - log(A i/rj) - 1 < - - 


K i= 1 


vi=1 


= 5 E 1 «V») - A i \ B V? - !) ! = - sot'll 2 , 


2 tr < Vi i 2 


2—1 


(V??) 2 - 1 

(Ar/t?) 2 + 1 

1 

2rf‘ 


- 1 


2?y 2 


U T e\e^ U\\ 2 F . 


The first inequality above uses the inequality log(a;) > x 2 — 1/ ( x 2 + 1) which holds for a: > 1. The second 
inequality is that x 2 +1 < 1 since x 2 — x + 1 is convex and minimized at a; = 1/2 in which case it takes 
value 3/4. So we have show than: 


KL( Pr.Po) < 



Et/~Unif||C^ r eief U\\ 


2 

F 


We will now upper bound this expectation. 


Ef/^Unif \\U T eiei U\\ 2 F = ^ Eu-^Unif^ijtAy 

*J=1 

This is the squared-Frobenius norm of the outer product of the first row (in R m ) with itself. Marginally, 
each entry of U, after squaring, is distributed as Beta(|, A-A) so the diagonal terms of this matrix (the 
terms where i = j above) are just the second (non-central) moment of the Beta distribution. These are 

3 < _ 3 _ 

d(d+ 2) — d 2 ’ 

For the off-diagonal terms, note that by spherical symmetry each row of U has a direction that is chosen 
uniform at random (in m dimensions) while the squared-norm of each row is distributed as Beta( . rl ., m ). 
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This second fact holds because UU T e i = ivei + y/u}(l — ui)z where z + e \ and u> ~ Beta(y, ^ Dl ), so 
that eJUU T e i = ||£/ T ei||| = u. If we let v £ K m denote a uniform at random unit vector, the off-diagonal 
terms can be written as: 


E w 2 v 2 v 2 = EvfvjE uj 2 < 


fe./j ~Zrnl2rnl2 + l 
1 V 3 d/2 d/2 + 1 


3 to to + 2 3 

m(m + 2) d d + 2 ~ d 2 


Here we use the Cauchy-Schwarz inequality, the fact that oj ~ Beta(^, d 2 " ) and that marginally each 
Vj ~ Beta(^, "‘Z/ 1 ) since v is a uniform random vector in m-dimensions. So every term in the sum is 
bounded by 3/d 2 . There are m 2 terms producing the bound: 


E^unif||t/ T eiefC/||| < 

Plugging this into our KL bound above and using additivity of KL-divergence for product measures, we 
arrive at the bound in the Lemma. 

B.5 Proof of Lemma [19] 

Recall that the quantity we are interested in lower bounding is: 

infE^n^HT^nx) - a;||| 

The expectation over T allows for randomized estimators, x is drawn uniformly at random from the //-radius 
sphere, and n is a uniformly drawn m dimensional projection operator. Instead of drawing an m-dimensional 
projection matrix n uniformly at random, it is equivalent to draw an orthonormal basis U £ R' /Xr " uniformly 
at random. The observation is then (U, U T x) which is clearly equivalent to observing (n, 11 ) since one can 
be constructed from the other and vice-versa. So we will instead lower bound: 

inf E x , Ut t\\T(U, U t x) -s||l = inf E XiUtT \\T(U, U T x)\\\ - 2T{U, U T x) T x + r, 2 

Before proceeding, we need to clarify one definition. We will use Z V. y = Z(I\/y. y) to denote the angle 
between the subspace V and the vector y. We can evaluate the integrals by first choosing a subspace U, 
choosing a vector y £ U, and finally choosing the vector x so that Pfjx = y. 

miE x ^ T \\T(U,Hx)-x\\ 2 2 >mf [ I I [E t ||T(J7, U t x)\\ 2 2 - 2T{U, U T x) T x + r/ 2 ] dP(x; y, U)dQ(y)dR(U) 
T T JU Jy Jx 


where P is the conditional distribution of x given that it projects to y with subspace U, Q is the distribution 
over projection vectors y and R is the uniform distribution on the m-dimensional Grassmannian manifold. 
Now we will push the inf-p inside of the first two integrals. Notice that all of the information the estimator 
has is U, y since U T x = U T y, so for each (U, y) pair, the estimator is just a distribution over vectors. Call 
this distribution T(-;y,U ): 

infEs,n,r||2\II,na;) - x\\l > [ [ inf [ f ||u||| - 2v T x + ri 2 dP{x\ y, U)dT(v; y, U)dQ{y)dR(U) 
1 JU Jy T V>y’ U ) Jv Jx 
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The only term depending on x is the v T x term, for which: 


/ 2v T xd,P(x\ y, U ) = 2v T E x ^p( x . y ' U) x = 2 v T i 

J X 


which follows by spherical symmetry, since we draw x uniformly from the //-radius sphere, which is sym¬ 
metric about the subspace U. Therefore we have: 


inf 

T(-,y,U) 


(INI 


2 

2 


2v T x + rj 1 ) dP(x\ y, U)dT{v : y, U ) 



= V 


2 


b\\l 


2v T y + if) dT(v;y, U) 


So we can lower bound by: 


infE Ki n,r||r(n,IIx) - x\\% > y 2 - [ [ \\y\\ 2 2 dQ(y)dR{U) 

T Ju Jy 

and are left to control the expected norm of y = Pjjx. We have the identity || j/|| = ||x|| cos Z(Pjjx, x) = 
Tj cos Z(Pjjx, x). By spherical symmetry, we can let U to be the span of the first m standard basis vectors, 
in which case: 


COS ZPjjX, x 




Therefore ||y||| = rfZ = ?/ 2 x ' 2 where xi,. 

Z^i=i x i 


Xd ~ Af( 0,1). This gives the lower bound: 


infE^n.Tlimilx) - x|| 2 > t/ 2 (1 - E [Z]) = t? 2 (l - j). 
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